RNAs are vital molecules that carry genetic information essential for life, with significant implications for drug development and biotechnology. However, RNA research is often slowed by the vast amount of literature. To address this, we introduce RNA-GPT, a multi-modal RNA chat model that simplifies RNA discovery by leveraging extensive RNA literature.
RNA-GPT combines RNA sequence encoders with linear projection layers and state-of-the-art large language models (LLMs) for precise representation alignment. This enables it to process user-uploaded RNA sequences and provide concise, accurate responses. Our scalable training pipeline, powered by RNA-QA, automatically gathers RNA annotations from RNACentral using a divide-and-conquer approach with GPT-4o and latent Dirichlet allocation (LDA) to handle large datasets and generate instruction tuning samples.
Experiments show RNA-GPT effectively handles complex RNA queries, streamlining RNA research. We also introduce RNA-QA, a 407,616 RNA dataset for modality alignment and instruction tuning.
Large language models (LLMs) trained on internet-scale corpora have been shown to perform extraordinarily well on a large array of tasks from Olympiad-level mathematical and scientific reasoning to planning long-term tasks for robotic systems. Recent advances in the biological and medical fields have enabled the adaptation of powerful models to accelerate research, significantly reducing reliance on traditional experiments.
Since proteins, RNAs, and DNAs can be represented as character strings and a vast amount of sequenced data is readily available, this has created an ideal environment for training language models to predict and generate protein, DNA, and RNA structures and sequences. Protein language models like ESM have successfully encoded protein sequence and structure information, inspiring works such as ProteinGPT and ProtSt, which adapt protein representations into a language-based format, enabling natural language querying of protein data.
Similar to ESM-2, works like RiNALMo and RNA-FM have utilized the flexible capabilities of LLMs to learn and predict RNA structure and functions. Much like the motivation behind protein research, where proteins are represented as strings of characters, RNAs—with their sequences of five unique nucleotides—have also sparked interest in computational RNA and DNA research using large language models (LLMs).
While models like ProteinGPT, ProtST, ProteinChat, and ProteinCLIP have made significant progress in aligning protein sequences and structures with textual descriptions, advancements in the DNA and RNA domains are far less advanced. Previous efforts, such as RiNALMo and RNA-FM have mainly focused on specific tasks like promoter or enhancer prediction, and structure and function analysis. ChatNT is among the few models striving to bridge the gap between RNA comprehension and natural language. However, its emphasis is more on performing biological tasks as a conversational agent rather than providing deep RNA understanding and comprehensive dialogue.
As a result, there is a notable gap in RNA chat models that offer in-depth knowledge. However, applying multimodal LLMs to RNA modeling presents unique challenges, especially in integrating diverse modalities such as textual descriptions, RNA sequences, and structural data.
To overcome these challenges, we propose a two-step approach to RNA-GPT. First, we utilize the RNA-FM sequence encoder to embed RNA sequences, followed by aligning these sequence representations with natural language through a large, automatically curated QA dataset from RNA Central. Secondly, to ensure our model generates concise and accurate responses, we break down RNA-QA’s abstract summaries into individual QA pairs for instruction tuning, enhancing the model’s ability to deliver clear and relevant answers. We utilize Meta AI’s flagship Llama-3 8B Instruction as our backbone LLM to provide solid general language understanding.
More specifically, our contributions are as follows:
RNA-GPT uses the pre-trained RNA-FM sequence encoder to embed RNA sequences, which are then passed through a linear projection layer. This layer learns to map the RNA embeddings to a shared representation space with natural language, enabling alignment with a backbone LLM, for which we chose Meta’s Llama-3 8B model. The training process is divided into two stages:
Figure 1: RNA-GPT Modality Fusion & Alignment Stage: we freeze the sequence encoder block and train the linear projection layer to learn how to align RNA sequence representations with text. In the alignment stage, the input to the training is only the projected RNA representation. No text prompts are incorporated in this stage.
Modality Alignment Stage (Stage 1): RNA sequences in the form of strings are first fed into the pre-trained sequence encoder, featuring 12 transformer layers trained with 23 million RNAs from the RNA Central database via self-supervised learning. We utilize a specialized token <RNAHere> for RNA-text modality alignment.
Instruction Tuning Stage (Stage 2): In stage 2, we instruction-tune the model using our curated RNA-QA dataset. We break down the full annotations into targeted QA samples with concise answers to specific questions as prediction targets. This allows the chat model to provide more relevant and accurate responses.
Figure 2: RNA-GPT Instruction Tuning Stage: we use the RNA representation from the alignment stage and combine it with question prompts for instruction tuning. The model generates answers that are concise and relevant to the questions.
To achieve modality alignment, we constructed a large-scale dataset from the RNA Central database, comprising 407,616 RNA sequences paired with abstract descriptions.
Divide and Conquer RNA Literature Summarization: We begin by filtering RNA sequences from RNA Central, focusing on those indexed with "Lit Scan," yielding around 420,000 RNAs with associated research papers. For the remaining 407,616 RNAs, we scrape and extract abstracts from all relevant literature. We apply LDA topic modeling to group papers by topic, summarizing each group individually. This ensures each summarization focuses on a narrower, cohesive subject area, minimizing information loss.
Figure 3: RNA-QA uses an automated pipeline to scrape and summarize existing RNA literature. We apply latent Dirichlet allocation (LDA) to group the vast literature on each RNA, and then we summarize each group individually using GPT-4o-mini. These summaries are then combined and refined to produce the final RNA annotation.
Data Augmentation: RNA-GPT decomposes the rich RNA annotations of RNA-QA into more specific QA-pairs for instruction tuning using GPT-4o-mini so that user instructions can be concisely answered.
We trained RNA-GPT using the flagship Llama-3 8B model architecture using a smaller subset of 5,000 RNAs and 121,000 QA samples for our initial model. We are in the process of training the larger RNA-GPT that utilizes all 407,616 RNAs of the RNA-QA dataset with millions of QA samples.
Metric | RNA Sequence | Modality Fusion | RNA-GPT | ||||||
---|---|---|---|---|---|---|---|---|---|
SBERT | SPub | SGPT | SBERT | SPub | SGPT | SBERT | SPub | SGPT | |
Precision | 0.7372 | 0.5528 | 0.5219 | 0.6929 | 0.6507 | 0.6655 | 0.8602 | 0.7384 | 0.7848 |
Recall | 0.7496 | 0.5270 | 0.5474 | 0.8028 | 0.6082 | 0.6603 | 0.8404 | 0.7208 | 0.7561 |
F1 Score | 0.7424 | 0.5387 | 0.5339 | 0.7403 | 0.6283 | 0.6627 | 0.8494 | 0.7293 | 0.7700 |
Table 1: RNA-QA (AIS): Comparison of RNA Sequence (left), Modality Fusion (middle), and RNA-GPT (right). Embedding base models are BERT, PubMedBERT, and OpenAI's GPT text-embedding-3-large.
We conducted a series of experiments to assess RNA-GPT's effectiveness both quantitatively and qualitatively, along with ablation studies to ascertain the importance of various modules at different stages. These included the original model (LLM with RNA sequence as text input), the modality-aligned model, and the final instruction-tuned model.
Metric | RNA Sequence | Modality Fusion | RNA-GPT | ||||||
---|---|---|---|---|---|---|---|---|---|
ROUGE-1 | ROUGE-2 | ROUGE-L | ROUGE-1 | ROUGE-2 | ROUGE-L | ROUGE-1 | ROUGE-2 | ROUGE-L | |
ROUGE | 0.2364 | 0.0935 | 0.2037 | 0.2239 | 0.1364 | 0.2091 | 0.5031 | 0.3667 | 0.4747 |
Table 2: RNA-QA (AIS): ROUGE Scores for RNA Sequence, Modality Fusion, and RNA-GPT.
Metric | RNA Sequence | Modality Fusion | RNA-GPT | ||||||
---|---|---|---|---|---|---|---|---|---|
SBERT | SPub | SGPT | SBERT | SPub | SGPT | SBERT | SPub | SGPT | |
Precision | 0.7612 | 0.5498 | 0.5479 | 0.6884 | 0.6201 | 0.6676 | 0.8620 | 0.7173 | 0.7568 |
Recall | 0.7654 | 0.5512 | 0.5649 | 0.8187 | 0.5830 | 0.6602 | 0.8623 | 0.7161 | 0.7554 |
F1 Score | 0.7625 | 0.5501 | 0.5561 | 0.7466 | 0.6005 | 0.6637 | 0.8609 | 0.7165 | 0.7560 |
Table 3: RNA-QA (D&C): Comparison of RNA Sequence (left), Modality Fusion (middle), and RNA-GPT (right). Embedding base models are BERT, PubMedBERT, and OpenAI's GPT text-embedding-3-large.
Metric | RNA Sequence | Modality Fusion | RNA-GPT | ||||||
---|---|---|---|---|---|---|---|---|---|
ROUGE-1 | ROUGE-2 | ROUGE-L | ROUGE-1 | ROUGE-2 | ROUGE-L | ROUGE-1 | ROUGE-2 | ROUGE-L | |
ROUGE | 0.2472 | 0.0964 | 0.2182 | 0.0922 | 0.0393 | 0.0799 | 0.4791 | 0.2690 | 0.4405 |
Table 4: RNA-QA (D&C): ROUGE Scores for RNA Sequence, Modality Fusion, and RNA-GPT.
The results demonstrate that RNA-GPT significantly outperforms both the original model and the modality fusion model in terms of precision, recall, F1 score, and ROUGE metrics. This indicates the effectiveness of our two-stage training process and the utility of the RNA-QA dataset.
Figures 4 and 5 illustrate the performance improvements of RNA-GPT over the baseline models. The ROUGE score comparison shows a significant increase in ROUGE-1, ROUGE-2, and ROUGE-L scores, indicating better overlap with the reference answers. The semantic score comparison, evaluated using BERT, PubMedBERT, and GPT embeddings, demonstrates enhanced semantic similarity between the generated and reference answers.
These experiments validate the effectiveness of our approach in aligning RNA sequences with natural language representations, enabling the model to generate accurate and relevant responses to complex RNA queries.