RNA-GPT: Multimodal Generative System for RNA Sequence Understanding

Yijia Xiao1, Edward Sun1, Yiqiao Jin2, Wei Wang1
1University of California, Los Angeles, 2Georgia Institute of Technology

Abstract

RNAs are vital molecules that carry genetic information essential for life, with significant implications for drug development and biotechnology. However, RNA research is often slowed by the vast amount of literature. To address this, we introduce RNA-GPT, a multi-modal RNA chat model that simplifies RNA discovery by leveraging extensive RNA literature.

RNA-GPT combines RNA sequence encoders with linear projection layers and state-of-the-art large language models (LLMs) for precise representation alignment. This enables it to process user-uploaded RNA sequences and provide concise, accurate responses. Our scalable training pipeline, powered by RNA-QA, automatically gathers RNA annotations from RNACentral using a divide-and-conquer approach with GPT-4o and latent Dirichlet allocation (LDA) to handle large datasets and generate instruction tuning samples.

Experiments show RNA-GPT effectively handles complex RNA queries, streamlining RNA research. We also introduce RNA-QA, a 407,616 RNA dataset for modality alignment and instruction tuning.

Introduction

Large language models (LLMs) trained on internet-scale corpora have been shown to perform extraordinarily well on a large array of tasks from Olympiad-level mathematical and scientific reasoning to planning long-term tasks for robotic systems. Recent advances in the biological and medical fields have enabled the adaptation of powerful models to accelerate research, significantly reducing reliance on traditional experiments.

Since proteins, RNAs, and DNAs can be represented as character strings and a vast amount of sequenced data is readily available, this has created an ideal environment for training language models to predict and generate protein, DNA, and RNA structures and sequences. Protein language models like ESM have successfully encoded protein sequence and structure information, inspiring works such as ProteinGPT and ProtSt, which adapt protein representations into a language-based format, enabling natural language querying of protein data.

Similar to ESM-2, works like RiNALMo and RNA-FM have utilized the flexible capabilities of LLMs to learn and predict RNA structure and functions. Much like the motivation behind protein research, where proteins are represented as strings of characters, RNAs—with their sequences of five unique nucleotides—have also sparked interest in computational RNA and DNA research using large language models (LLMs).

While models like ProteinGPT, ProtST, ProteinChat, and ProteinCLIP have made significant progress in aligning protein sequences and structures with textual descriptions, advancements in the DNA and RNA domains are far less advanced. Previous efforts, such as RiNALMo and RNA-FM have mainly focused on specific tasks like promoter or enhancer prediction, and structure and function analysis. ChatNT is among the few models striving to bridge the gap between RNA comprehension and natural language. However, its emphasis is more on performing biological tasks as a conversational agent rather than providing deep RNA understanding and comprehensive dialogue.

As a result, there is a notable gap in RNA chat models that offer in-depth knowledge. However, applying multimodal LLMs to RNA modeling presents unique challenges, especially in integrating diverse modalities such as textual descriptions, RNA sequences, and structural data.

To overcome these challenges, we propose a two-step approach to RNA-GPT. First, we utilize the RNA-FM sequence encoder to embed RNA sequences, followed by aligning these sequence representations with natural language through a large, automatically curated QA dataset from RNA Central. Secondly, to ensure our model generates concise and accurate responses, we break down RNA-QA’s abstract summaries into individual QA pairs for instruction tuning, enhancing the model’s ability to deliver clear and relevant answers. We utilize Meta AI’s flagship Llama-3 8B Instruction as our backbone LLM to provide solid general language understanding.

More specifically, our contributions are as follows:

  • Novel Framework: RNA-GPT is one of the first multi-modal RNA sequence chat models that enables deep, interactive RNA-focused conversations, significantly enhancing the understanding of RNAs for biological research.
  • Large-scale Dataset and Collection Pipeline: We introduce RNA-QA, a QA dataset derived from the RNA Central Database for modality alignment instruction tuning of RNA chat models. We also present our highly scalable collection pipeline that automates the scraping and summarizing of relevant literature on RNA. Using a divide-and-conquer summarization strategy, we ensure that research details are preserved effectively.

Methodology

RNA-GPT uses the pre-trained RNA-FM sequence encoder to embed RNA sequences, which are then passed through a linear projection layer. This layer learns to map the RNA embeddings to a shared representation space with natural language, enabling alignment with a backbone LLM, for which we chose Meta’s Llama-3 8B model. The training process is divided into two stages:

  1. Sequence and Modality Alignment: RNA and natural language representations are aligned.
  2. Instruction Tuning: The model is fine-tuned for task-specific QA generation.
Modality Alignment Stage

Figure 1: RNA-GPT Modality Fusion & Alignment Stage: we freeze the sequence encoder block and train the linear projection layer to learn how to align RNA sequence representations with text. In the alignment stage, the input to the training is only the projected RNA representation. No text prompts are incorporated in this stage.

Modality Alignment Stage (Stage 1): RNA sequences in the form of strings are first fed into the pre-trained sequence encoder, featuring 12 transformer layers trained with 23 million RNAs from the RNA Central database via self-supervised learning. We utilize a specialized token <RNAHere> for RNA-text modality alignment.

Instruction Tuning Stage (Stage 2): In stage 2, we instruction-tune the model using our curated RNA-QA dataset. We break down the full annotations into targeted QA samples with concise answers to specific questions as prediction targets. This allows the chat model to provide more relevant and accurate responses.

Instruction Tuning Stage

Figure 2: RNA-GPT Instruction Tuning Stage: we use the RNA representation from the alignment stage and combine it with question prompts for instruction tuning. The model generates answers that are concise and relevant to the questions.

RNA-QA Dataset

To achieve modality alignment, we constructed a large-scale dataset from the RNA Central database, comprising 407,616 RNA sequences paired with abstract descriptions.

Divide and Conquer RNA Literature Summarization: We begin by filtering RNA sequences from RNA Central, focusing on those indexed with "Lit Scan," yielding around 420,000 RNAs with associated research papers. For the remaining 407,616 RNAs, we scrape and extract abstracts from all relevant literature. We apply LDA topic modeling to group papers by topic, summarizing each group individually. This ensures each summarization focuses on a narrower, cohesive subject area, minimizing information loss.

RNA-QA Dataset Pipeline

Figure 3: RNA-QA uses an automated pipeline to scrape and summarize existing RNA literature. We apply latent Dirichlet allocation (LDA) to group the vast literature on each RNA, and then we summarize each group individually using GPT-4o-mini. These summaries are then combined and refined to produce the final RNA annotation.

Data Augmentation: RNA-GPT decomposes the rich RNA annotations of RNA-QA into more specific QA-pairs for instruction tuning using GPT-4o-mini so that user instructions can be concisely answered.

Experiments

We trained RNA-GPT using the flagship Llama-3 8B model architecture using a smaller subset of 5,000 RNAs and 121,000 QA samples for our initial model. We are in the process of training the larger RNA-GPT that utilizes all 407,616 RNAs of the RNA-QA dataset with millions of QA samples.

Metric RNA Sequence Modality Fusion RNA-GPT
SBERT SPub SGPT SBERT SPub SGPT SBERT SPub SGPT
Precision 0.73720.55280.5219 0.69290.65070.6655 0.86020.73840.7848
Recall 0.74960.52700.5474 0.80280.60820.6603 0.84040.72080.7561
F1 Score 0.74240.53870.5339 0.74030.62830.6627 0.84940.72930.7700

Table 1: RNA-QA (AIS): Comparison of RNA Sequence (left), Modality Fusion (middle), and RNA-GPT (right). Embedding base models are BERT, PubMedBERT, and OpenAI's GPT text-embedding-3-large.

We conducted a series of experiments to assess RNA-GPT's effectiveness both quantitatively and qualitatively, along with ablation studies to ascertain the importance of various modules at different stages. These included the original model (LLM with RNA sequence as text input), the modality-aligned model, and the final instruction-tuned model.

Metric RNA Sequence Modality Fusion RNA-GPT
ROUGE-1 ROUGE-2 ROUGE-L ROUGE-1 ROUGE-2 ROUGE-L ROUGE-1 ROUGE-2 ROUGE-L
ROUGE 0.23640.09350.2037 0.22390.13640.2091 0.50310.36670.4747

Table 2: RNA-QA (AIS): ROUGE Scores for RNA Sequence, Modality Fusion, and RNA-GPT.

ROUGE Score Comparison
Figure 4: ROUGE Score Comparison
Semantic Score Comparison
Figure 5: Semantic Score Comparison
Metric RNA Sequence Modality Fusion RNA-GPT
SBERT SPub SGPT SBERT SPub SGPT SBERT SPub SGPT
Precision 0.76120.54980.5479 0.68840.62010.6676 0.86200.71730.7568
Recall 0.76540.55120.5649 0.81870.58300.6602 0.86230.71610.7554
F1 Score 0.76250.55010.5561 0.74660.60050.6637 0.86090.71650.7560

Table 3: RNA-QA (D&C): Comparison of RNA Sequence (left), Modality Fusion (middle), and RNA-GPT (right). Embedding base models are BERT, PubMedBERT, and OpenAI's GPT text-embedding-3-large.

Metric RNA Sequence Modality Fusion RNA-GPT
ROUGE-1 ROUGE-2 ROUGE-L ROUGE-1 ROUGE-2 ROUGE-L ROUGE-1 ROUGE-2 ROUGE-L
ROUGE 0.24720.09640.2182 0.09220.03930.0799 0.47910.26900.4405

Table 4: RNA-QA (D&C): ROUGE Scores for RNA Sequence, Modality Fusion, and RNA-GPT.

The results demonstrate that RNA-GPT significantly outperforms both the original model and the modality fusion model in terms of precision, recall, F1 score, and ROUGE metrics. This indicates the effectiveness of our two-stage training process and the utility of the RNA-QA dataset.

Figures 4 and 5 illustrate the performance improvements of RNA-GPT over the baseline models. The ROUGE score comparison shows a significant increase in ROUGE-1, ROUGE-2, and ROUGE-L scores, indicating better overlap with the reference answers. The semantic score comparison, evaluated using BERT, PubMedBERT, and GPT embeddings, demonstrates enhanced semantic similarity between the generated and reference answers.

These experiments validate the effectiveness of our approach in aligning RNA sequences with natural language representations, enabling the model to generate accurate and relevant responses to complex RNA queries.