Two-layer retrieval augmented generation framework for low-resource medical question-answering: proof of concept using Reddit data (2024)

Sudeshna DasDepartment of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, USA
{sudeshna.das, yao.ge, yuting.guo, jamor.hairston, jeanne.marie.powell, andrew.walker, sahithi.krishnaveni.lakamana, selen.bozkurt,matthew.a.reyna,reza.sameni}@emory.edu
Yao GeDepartment of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, USA
{sudeshna.das, yao.ge, yuting.guo, jamor.hairston, jeanne.marie.powell, andrew.walker, sahithi.krishnaveni.lakamana, selen.bozkurt,matthew.a.reyna,reza.sameni}@emory.edu
Yuting GuoDepartment of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, USA
{sudeshna.das, yao.ge, yuting.guo, jamor.hairston, jeanne.marie.powell, andrew.walker, sahithi.krishnaveni.lakamana, selen.bozkurt,matthew.a.reyna,reza.sameni}@emory.edu
Swati RajwalDepartment of Computer Science and Informatics, Emory University, Atlanta, GA, USA
swati.rajwal@emory.edu
JaMor HairstonDepartment of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, USA
{sudeshna.das, yao.ge, yuting.guo, jamor.hairston, jeanne.marie.powell, andrew.walker, sahithi.krishnaveni.lakamana, selen.bozkurt,matthew.a.reyna,reza.sameni}@emory.edu
Jeanne PowellDepartment of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, USA
{sudeshna.das, yao.ge, yuting.guo, jamor.hairston, jeanne.marie.powell, andrew.walker, sahithi.krishnaveni.lakamana, selen.bozkurt,matthew.a.reyna,reza.sameni}@emory.edu
Drew WalkerDepartment of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, USA
{sudeshna.das, yao.ge, yuting.guo, jamor.hairston, jeanne.marie.powell, andrew.walker, sahithi.krishnaveni.lakamana, selen.bozkurt,matthew.a.reyna,reza.sameni}@emory.edu
Snigdha PeddireddyDepartment of Behavioral, Social, & Health Education Sciences, Rollins School of Public Health, Emory University, Atlanta, GA, USA
snigdha.peddireddy@emory.edu
Sahithi LakamanaDepartment of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, USA
{sudeshna.das, yao.ge, yuting.guo, jamor.hairston, jeanne.marie.powell, andrew.walker, sahithi.krishnaveni.lakamana, selen.bozkurt,matthew.a.reyna,reza.sameni}@emory.edu
Selen BozkurtDepartment of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, USA
{sudeshna.das, yao.ge, yuting.guo, jamor.hairston, jeanne.marie.powell, andrew.walker, sahithi.krishnaveni.lakamana, selen.bozkurt,matthew.a.reyna,reza.sameni}@emory.edu
Matthew ReynaDepartment of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, USA
{sudeshna.das, yao.ge, yuting.guo, jamor.hairston, jeanne.marie.powell, andrew.walker, sahithi.krishnaveni.lakamana, selen.bozkurt,matthew.a.reyna,reza.sameni}@emory.edu
Reza SameniDepartment of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, USA
{sudeshna.das, yao.ge, yuting.guo, jamor.hairston, jeanne.marie.powell, andrew.walker, sahithi.krishnaveni.lakamana, selen.bozkurt,matthew.a.reyna,reza.sameni}@emory.edu
Yunyu XiaoDepartment of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA
yux4008@med.cornell.edu
Sangmi KimNell Hodgson Woodruff School of Nursing, Emory University, Atlanta, GA, USA
{sangmi.kim,r.d.chandler}@emory.edu
Rasheeta ChandlerNell Hodgson Woodruff School of Nursing, Emory University, Atlanta, GA, USA
{sangmi.kim,r.d.chandler}@emory.edu
Natalie HernandezCenter for Maternal Health Equity, Morehouse School of Medicine, Atlanta, GA, USA
nhernandez@msm.edu
Danielle MoweryDepartment of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, PA, USA
dlmowery@pennmedicine.upenn.edu
Rachel WightmanDepartment of Emergency Medicine, Warren Alpert Medical School of Brown University, Providence, RI, USA
rachel_wightman@brown.edu
Jennifer LoveDepartment of Emergency Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
jennifer.love@mountsinai.org
Anthony SpadaroDepartment of Emergency Medicine, Rutgers New Jersey Medical School, Newark, NJ, USA
avs156@njms.rutgers.edu
Jeanmarie PerroneDepartment of Emergency Medicine, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA, USA
jeanmarie.perrone@pennmedicine.upenn.edu
Abeed SarkerDepartment of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, USA
{sudeshna.das, yao.ge, yuting.guo, jamor.hairston, jeanne.marie.powell, andrew.walker, sahithi.krishnaveni.lakamana, selen.bozkurt,matthew.a.reyna,reza.sameni}@emory.edu
Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA, USA
abeed.sarker@emory.edu

Abstract

Retrieval augmented generation (RAG) provides the capability to constrain generative model outputs, and mitigate the possibility of hallucination, by providing relevant in-context text.The number of tokens a generative large language model (LLM) can incorporate as context is finite, thus limiting the volume of knowledge from which to generate an answer.We propose a two-layer RAG framework for query-focused answer generation and evaluate a proof-of-concept for this framework in the context of query-focused summary generation from social media forums, focusing on emerging drug-related information.The evaluations demonstrate the effectiveness of the two-layer framework in resource constrained settings to enable researchers in obtaining near real-time data from users.

**footnotetext: Equal contribution.

The emergence of generative large language models (LLMs) have opened up unprecedented opportunities for solving traditionally complex biomedical natural language processing (NLP) problems, such as medical question-answering (MQA). However, many practical operational challenges exist to their deployment and use in real life, such as the high computational resource requirement. Another major issue with text produced by LLMs, particularly pertinent to MQA, is ‘‘hallucination’’. Hallucination in the context of LLMs refers to these models generating text that is plausible-sounding but nonsensical or incorrect information[7]. Chain-of-thought prompting[4], self-reflection[7], and retrieval-augmented generation have been forerunners in addressing this issue. In addition to mitigating hallucination, retrieval-augmented generation (RAG) aids in constraining generated texts and improves in-context learning[5]. The use of LLMs in RAG frameworks has recently been seen in the biomedical domain owing to the need for timely, accurate, and transparent responses[15]. Since generative AI, particularly LLMs, are becoming increasingly integrated into clinical practice[9], it is important to ensure that such systems are equitable (i.e., can operate in low-resource settings)[6], while also generating accurate and coherent texts.

We present a proof of concept for a two-layer retrieval-augmented generative framework for MQA, modeled as query-focused text summarization, that ingests user-generated medical information from the social media website Reddit to answer medical questions. We specifically focus on using smaller, quantized open-source LLMs that are able to run on personal computers without the need for specialized computational hardware. This architectural choice allows our framework to be used in low-resource settings, ensuring equitable access to timely medical information. We evaluate our model on questions focusing on emerging information about medications and substances prone to nonmedical use. Our proposed framework modularizes generating data-driven summaries to answer clinicians’ queries in two steps: (i) retrieving relevant Reddit posts for a given query, segmenting each post to fit context window length, and summarizing each segment individually, and (ii) ingesting the individual summaries to generate a final summary that answers the original query. Given the modular nature of our framework, any retrieval engine or LLM can be used by the system. It also allows the use of distinct LLMs that specialize in summarization at different context lengths in each of the two layers. Our framework was designed and developed with the goal of enabling the generation of answers to clinical questions relying only on text that is provided to an LLM. Thus, any prior knowledge encoded within the LLM about the chosen topic should, ideally, not influence the generated text. We note here that our goal is not to vet the provided text (Reddit posts) for inaccurate information. It is, in fact, desirable for the system to summarize any misinformation present in the posts for faithfulness and transparency. Figure1 illustrates the architecture for the two-layer answer generation process.

Two-layer retrieval augmented generation framework for low-resource medical question-answering: proof of concept using Reddit data (1)

Since our focus is on testing system performances in low-resource settings, we used a relatively small model as our LLM along with a fast retrieval engine that creates relatively small inverted indexes. This lightweight prototype: (i) allows us to test the minimum threshold of performance of this architecture, and (ii) ensure that RAG systems built using this architecture can be deployed in low-resource settings. To demonstrate the applicability of the framework using LLMs of varying model sizes, we also evaluate the proposed approach using GPT-4[1], one of the largest LLMs currently available.

We study the topic of emerging drugs from the social network Reddit. Reddit has over 52 million daily active users and is commonly used to study emerging themes in the field of medicine[11]. Reddit features a large volume of discussions about substances and their nonmedical uses, and, in recent years, data from Reddit has been leveraged particularly to study emerging information about novel psychoactive substances since such information is not typically available elsewhere. We chose two substances that have gained attention recently---xylazine (because of its increasing impact and association with the US opioid crisis) and ketamine (because of its recent popularity as a treatment for depression). We extracted all Reddit posts mentioning xylazine (N=177,684𝑁177684N=177,684italic_N = 177 , 684) and ketamine (N=7,699𝑁7699N=7,699italic_N = 7 , 699). Based on clinician-driven interests, we formulated a set of 20 queries associated with these two substances.

We conducted extensive expert evaluation of the generated answers in terms of coverage, coherence, relevance, length, and hallucination. Annotators were not made aware of which LLM was used to generate the summaries, for fair evaluation. In terms of coverage, median scores for GPT-4 and NousHermes2 7B DPO were 4 and 5 respectively, on a 5-point Likert scale; there was no significant difference between GPT-4 and NousHermes2 7B DPO (Mann–Whitney U𝑈Uitalic_U = 98.5, n1subscript𝑛1n_{1}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 9, n2subscript𝑛2n_{2}italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 34, p𝑝pitalic_p = 0.067 two-tailed). Median scores for coherence on a 5-point Likert scale for GPT-4 and NousHermes2 7B DPO were 4 and 5 respectively; the distributions in the two groups differed significantly (Mann–Whitney U𝑈Uitalic_U = 69.5, n1subscript𝑛1n_{1}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 9, n2subscript𝑛2n_{2}italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 34, p𝑝pitalic_p = 0.002 two-tailed).

For the evaluation criteria of relevance, the median scores for GPT-4 and NousHermes2 7B DPO were 3 for both groups; there was no significant difference between the two groups (Mann–Whitney U𝑈Uitalic_U = 157.5, n1subscript𝑛1n_{1}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 9, n2subscript𝑛2n_{2}italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 34, p𝑝pitalic_p = 0.647 two-tailed). Median scores on a 3-point Likert scale for GPT-4 and NousHermes2 7B DPO were 3 for both groups; there was no significant difference between the two (Mann–Whitney U𝑈Uitalic_U = 148, n1subscript𝑛1n_{1}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 9, n2subscript𝑛2n_{2}italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 34, p𝑝pitalic_p = 0.875 two-tailed).

On the binary Likert scale for hallucination, the median scores for GPT-4 and NousHermes2 7B DPO were 0 for both groups; there was no significant difference between GPT-4 and NousHermes2 7B DPO (Mann–Whitney U𝑈Uitalic_U = 165.5, n1subscript𝑛1n_{1}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 9, n2subscript𝑛2n_{2}italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 34, p𝑝pitalic_p = 0.326 two-tailed).

The median Coleman-Liau Index for GPT-4 and NousHermes2 7B DPO were 12.82 and 12.125 respectively; there was no significant difference between GPT-4 and NousHermes2 7B DPO (Mann–Whitney U𝑈Uitalic_U = 33.5, n1subscript𝑛1n_{1}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 4, n2subscript𝑛2n_{2}italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 16, p𝑝pitalic_p = 0.865 two-tailed).

Of the queries given to the system, median token counts for queries posed to GPT-4 and NousHermes2 7B DPO, were 5 and 7 respectively; there was no significant difference between the two groups (Mann–Whitney U𝑈Uitalic_U = 16.0, n1subscript𝑛1n_{1}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 4, n2subscript𝑛2n_{2}italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 16, p𝑝pitalic_p = 0.083 two-tailed).

The median lengths of responses generated by GPT-4 and NousHermes2 7B DPO were 765 and 441 for the combined individual summaries, and 107 and 61 for the final summaries respectively. In both cases, there was no significant different between the two groups (Mann–Whitney U𝑈Uitalic_U = 53.0, n1subscript𝑛1n_{1}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 4, n2subscript𝑛2n_{2}italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 16, p𝑝pitalic_p = 0.160 two-tailed, and Mann–Whitney U𝑈Uitalic_U = 57.0, n1subscript𝑛1n_{1}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT= 4, n2subscript𝑛2n_{2}italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 16, p𝑝pitalic_p = 0.081 two-tailed).

Thus, our proposed system is able to answer queries with high relevance and faithfulness to the retrieved document. In particular, the system is able to answer questions such as ‘‘What are k cramps like?" that is difficult to answer by manual perusal of the large volume of ketamine-related posts on Reddit.

With the rapidly changing trends in novel substances and non-prescription use of drugs, our proposed framework can aid clinicians in obtaining insights into emerging side-effects of drugs, potential concurrent use of multiple substances, as well as the general perception of people toward specific drugs. Since our framework is able to synthesize responses almost entirely based on the text given to it, an interesting application of our system can be to detect misinformation on specific substances. Furthermore, since it is trivial to specify date ranges in the IR module of the framework, it can also be used to answer queries focused within specific dates.

Two-layer retrieval augmented generation framework for low-resource medical question-answering: proof of concept using Reddit data (2)
Two-layer retrieval augmented generation framework for low-resource medical question-answering: proof of concept using Reddit data (3)
Two-layer retrieval augmented generation framework for low-resource medical question-answering: proof of concept using Reddit data (4)
Two-layer retrieval augmented generation framework for low-resource medical question-answering: proof of concept using Reddit data (5)
Two-layer retrieval augmented generation framework for low-resource medical question-answering: proof of concept using Reddit data (6)
Two-layer retrieval augmented generation framework for low-resource medical question-answering: proof of concept using Reddit data (7)
Two-layer retrieval augmented generation framework for low-resource medical question-answering: proof of concept using Reddit data (8)
Two-layer retrieval augmented generation framework for low-resource medical question-answering: proof of concept using Reddit data (9)
Two-layer retrieval augmented generation framework for low-resource medical question-answering: proof of concept using Reddit data (10)

METHODS

System Architecture

As depicted in Figure1, first the user submits a query regarding a specific topic. The query is parsed by the information retrieval (IR) engine, which returns a ranked list of documents. From this ranked list, the top n documents are chosen to be sources for answer generation. In the first layer of the two-layer LLM architecture, the LLM is provided with the (i) query, (ii) text from the retrieved document(s), and (iii) a prompt that embeds the text and instructs the LLM to summarize the given texts. Since the prompt context window is finite, it is typically impossible to feed the LLM all the necessary retrieved text to generate the answer. Even single documents can be too long. Thus, the framework allows for the specification of segment lengths for the retrieved text in each iteration, ensuring that the framework is applicable for relatively small LLMs with shorter context lengths. The first LLM layer generates short, query-focused summaries. Figure1 presents an example of a prompt embedding a retrieved text segment within it, and the resulting summary. If the retrieved text segment does not contain an answer to the question, the LLM states so. A sample of examples of this first layer of summarization are provided in the supplementary material.

The second layer of the LLM takes as input the original query, and individual short summaries embedded within a second prompt that is optimized for synthesizing the individual summaries, while ignoring summaries where the LLM clearly states that the text segment did not contain the answer. Figure 1 an example of the final, synthesized summary to the original query, based on the texts provided.

System Setting and Data

We used the 8-bit quantized model Nous-Hermes 2 7B DPO as our LLM and the Python ‘‘Whoosh’’ package[2] as the retrieval engine for our primary setup. The Nous-Hermes 2 7B DPO model is instruction-tuned on 1,000,000 high quality instructions/chats[13]. It is an open-source model and can be run locally. We evaluated our proof of concept in a setting where large amounts of data are available for a given topic, but gathering insights and answering questions related to the topic based on the data requires substantial manual work. We collected all available data from Reddit via the PushShift Application Programming Interface (API) until December 31, 2023---close to 2.5 billion posts, and extracted posts mentioning xylazine and nitazene. These posts represent the documents to be retrieved by our retrieval engine.

In order to test the performance of the proposed framework with larger models, we also performed evaluation using GPT-4 (speculated to have over 1T parameters) as the LLM in both the layers. This forms our secondary evaluation setup.

Retrieval Augmented Generation

We employed a simple keyword-based retrieval in which the question original question was tokenized. Since the retrieval aspect of this architecture is not our primary focus, we used the default search settings provided by our IR package, which uses the Okapi BM25F as the ranking function. From the ranked retrieved documents, the top 50505050 were chosen for generating the first-layer Individual Summaries. We found this number to be sufficient, although this number may be adjusted as per need, without requiring any changes to the architecture. Note that the total number of text segments is typically higher because many posts are long and do not fit entirely within the context window of our model, particularly after being embedded within the prompt. Our work is similar to [12]. However, unlike this approach where segments of texts are generated chronologically, in our work, segmentation is done at the post-level, without accounting for chronology.

  • Prompt 1 (Layer 1): ‘‘Summarize the following text """""" in response to the question {{\{{QUERY}}\}}"

  • Prompt 2 (Layer 2): ‘‘Summarize the individual summaries based on the question {{\{{QUERY}}\}}"

Prompt 1 used in the first layer of the architecture is passed along with the top 50505050 retrieved documents to the LLM. Prompt 2 used in the second layer of the architecture is given in conjunction with the individual summaries generated by the first layer of the system.

Evaluation

Our evaluation focused on summary generation quality, rather than the retrieval performance, of our proposed architecture. Commonly used automatic summary evaluation methods, such as ROUGE[8] and BLEU[10] primarily focus on text overlap between generated summaries and gold-standard summaries. In the absence of such gold standard summaries, we employed a manual evaluation conducted by subject matter experts. Our emphasis on manual evaluation further allows us to qualitatively evaluate the important nuances of generative summaries, which is not possible with ROUGE or BLEU. We used a Likert scale-based evaluation involving two questions with 5-point scales, two questions with 3-point scales, and 1 question with a binary answer. Table1 lists the criteria, corresponding questions asked to the annotators, and the evaluation scales used. In addition to the manual evaluation, we also assess the readability of the final generated summaries using the Coleman-Liau Readability Index[3], which approximates the US grade level required to comprehend the text.

Statistical Analysis

We performed non-parametric tests for proportions (Mann Whitney U test) with the null hypothesis H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: The two populations are equal to determine if the scores assigned to answers generated by GPT-4 and Nous-Hermes 2 7B DPO vary significantly. The alternative hypothesis H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT: The populations are not equal was accepted when H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is rejected with p𝑝pitalic_p < 0.05. All tests were performed using the SciPy package[14].

CriteriaQuestionEvaluation Scale
CoverageDoes the final summary accurately represent the information present in the original text?5: Yes; the final summary covers all the important information present in the original text.
4: Mostly; the final summary covers most, but not all of the important information.
3: Somewhat; the final summary covers some of the important information, but also misses some of them.
2: Not really; the final summary misses most of the important information.
1: No; the final summary does not cover any of the important information present in the original text.
CoherenceIs the final summary coherent?5: Yes; the final summary is easy to read and understand.
4: Mostly; the final summary is readable, but not straightforward to understand.
3: Somewhat; the final summary is readable, but confusing.
2: Not really; the final summary has some grammatical errors or non sequiturs.
1: No; the final summary is unintelligible or incomprehensible.
RelevanceDoes the final summary answer the original question?3: Yes; the summary answers the original question.
2: Partially; the summary answers the original question, but not fully.
1: No; the summary does not answer the original question.
LengthIs the length of the final summary appropriate?3: Yes; the summary is appropriate in length.
2: Somewhat; the summary could be shorter/longer.
1: No; the summary is long-winded/too short.
HallucinationDoes the summary contain information not present in the original text?0: No; the summary does not contain information not present in the original text.
1: Yes; the summary contains information not present in the original text.

Data Availability

All data used in this study were publicly available from Reddit at the time of data collection. The first and second level summaries are available as supplementary material. The original posts social media posts used in the study are not being made public in order to preserve the anonymity of the authors. The anonymized posts are available from the corresponding author upon reasonable request and the completion of a data use agreement.

Competing Interests

All authors declare no financial or non-financial competing interests.

References

  • [1]J.Achiam, S.Adler, S.Agarwal, L.Ahmad, I.Akkaya, F.L. Aleman, D.Almeida, J.Altenschmidt, S.Altman, S.Anadkat, etal.Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023.
  • [2]M.Chaput.Whoosh.https://whoosh.readthedocs.io/en/latest, 2007.
  • [3]M.Coleman and T.L. Liau.A computer readability formula designed for machine scoring.Journal of Applied Psychology, 60(2):283, 1975.
  • [4]S.Dhuliawala, M.Komeili, J.Xu, R.Raileanu, X.Li, A.Celikyilmaz, and J.Weston.Chain-of-verification reduces hallucination in large language models.arXiv preprint arXiv:2309.11495, 2023.
  • [5]J.Ge, S.Sun, J.Owens, V.Galvez, O.Gologorskaya, J.C. Lai, M.J. Pletcher, and K.Lai.Development of a liver disease-specific large language model chat interface using retrieval augmented generation.medRxiv, 2023.
  • [6]S.Ghosh, U.Tyagi, S.Kumar, and D.Manocha.Bioaug: Conditional generation based data augmentation for low-resource biomedical ner.In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1853--1858, 2023.
  • [7]Z.Ji, T.Yu, Y.Xu, N.Lee, E.Ishii, and P.Fung.Towards mitigating llm hallucination via self reflection.In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1827--1843, 2023.
  • [8]C.-Y. Lin.Rouge: A package for automatic evaluation of summaries.In Text summarization branches out, pages 74--81, 2004.
  • [9]S.L. McNamara, P.H. Yi, and W.Lotter.The clinician-ai interface: intended use and explainability in fda-cleared ai devices for medical image interpretation.NPJ Digital Medicine, 7(1):80, 2024.
  • [10]K.Papineni, S.Roukos, T.Ward, and W.-J. Zhu.Bleu: a method for automatic evaluation of machine translation.In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311--318, 2002.
  • [11]S.Somani, S.Balla, A.W. Peng, R.Dudum, S.Jain, K.Nasir, D.J. Maron, T.Hernandez-Boussard, and F.Rodriguez.Contemporary attitudes and beliefs on coronary artery calcium from social media using artificial intelligence.NPJ Digital Medicine, 7(1):1--6, 2024.
  • [12]M.Subbiah, S.Zhang, L.B. Chilton, and K.McKeown.Reading subtext: Evaluating large language models on short story summarization with writers.arXiv preprint arXiv:2403.01061, 2024.
  • [13]"Teknium", "theemozilla", "karan4d", and "huemin_art".Nous hermes 2 mistral 7b dpo.
  • [14]P.Virtanen, R.Gommers, T.E. Oliphant, M.Haberland, T.Reddy, D.Cournapeau, E.Burovski, P.Peterson, W.Weckesser, J.Bright, S.J. van der Walt, M.Brett, J.Wilson, K.J. Millman, N.Mayorov, A.R.J. Nelson, E.Jones, R.Kern, E.Larson, C.J. Carey, İ.Polat, Y.Feng, E.W. Moore, J.VanderPlas, D.Laxalde, J.Perktold, R.Cimrman, I.Henriksen, E.A. Quintero, C.R. Harris, A.M. Archibald, A.H. Ribeiro, F.Pedregosa, P.van Mulbregt, and SciPy 1.0 Contributors.SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python.Nature Methods, 17:261--272, 2020.
  • [15]G.Xiong, Q.Jin, Z.Lu, and A.Zhang.Benchmarking retrieval-augmented generation for medicine.arXiv preprint arXiv:2402.13178, 2024.

Supplementary Material

Appendix A Queries

Query IDLLMQuery
1NousHermes2What are the side effects of xylazine?
2NousHermes2What does xylazine do to the skin?
3NousHermes2How does xylazine impact rehab?
4GPT-4What is xylazine withdrawal like?
5GPT-4What drugs contain xylazine?
6NousHermes2What treatments work for xylazine?
7NousHermes2What drugs are mixed or cut with xylazine?
8NousHermes2What areas of the US are impacted by xylazine?
9NousHermes2How is xylazine different from pure heroin?
10NousHermes2What is the general sentiment associated with xylazine?
11NousHermes2Does narcan or naloxone work for xylazine overdose?
12NousHermes2What are the side effects of ketamine?
13NousHermes2What is ketamine withdrawal like?
14GPT-4What are k cramps like?
15GPT-4How do the users describe k hole?
16NousHermes2Does ketamine work for depression?
17NousHermes2What drugs are ketamine co-used with recreationally?
18NousHermes2Is ketamine effective for treatment of suicidal behavior?
19NousHermes2How can you treat ketamine addiction?
20NousHermes2Does ketamine use cause cramps?

Appendix B Sample First Layer Individual Summaries

Query 1: What are the side effects of xylazine?

Individual summaries:

  • The side effects mentioned for xylazine include heavy nod, unresponsiveness, decreased breathing, change in skin color (pale, blue, gray skin/lips/nails), coma. It can also cause memory loss and very serious wounds after extended use through IV or smoking methods.

  • The side effects mentioned in the text include necrotic tissue on the septum, painful sores within the nose, a constantly swollen and blocked nose, feverish feelings, and possibly having a sinus infection.

  • There is no information in the given paragraph that specifically mentions any side effects of xylazine.

  • The text provided does not mention any side effects of xylazine.

  • The side effects mentioned include falling asleep easily, euphoria followed by coughing fits and pounding heart, causing a brown discharge when blowing nose hours later, runny nose that clogs up and dries like paint on paper towels, and leaving hard white spots on clothing.

  • The side effects of xylazine include clogging up the nose, causing dry mouth, leading to abscesses and skin ulcers (if injected), slowing wound healing, causing unexplained sores on various parts of the body regardless of method of use, making one feel lightheaded with blackout periods and minimal euphoria or other opioid-associated symptoms like itchiness. It also causes overwhelming tiredness leading to sleep instead of nodding. Xylazine is known for its damaging neurological effects as well.

Query 17: What drugs are ketamine co-used with recreationally?

Individual summaries:

  • Ketamine is co-used with MDMA, cocaine or co*ke when going out or having a party for recreational purposes.

  • There is no specific mention of other drugs that are co-used with ketamine in a recreational manner within the given paragraph.

  • Ketamine is co-used with LSD, Kavva, Kanna, Lean in a recreational setting.

  • Ketamine is mentioned to be used in conjunction with other controlled substances such as hallucinogenics (psilocybin mushrooms and LSD), stimulants (MDMA, MDA, cocaine) at the concert event.

Appendix C Final Summaries

QueryFinal Summary
What are the side effects of xylazine?The summaries suggest that potential side effects of xylazine include heavy nodding, unresponsiveness, decreased breathing, changes in skin color (pale, blue or gray), coma, memory loss, serious wounds from IV or smoking use, necrotic tissue on the septum, painful sores within the nose, a constantly swollen and blocked nose, feverish feelings, sinus infection, falling asleep easily, euphoria followed by coughing fits and pounding heart, brown discharge when blowing nose hours later, runny nose that clogs up and dries like paint on paper towels, hard white spots on clothing, slower heart rate, low blood pressure, reduced breathing, ulcer skin wounds, amputations, horrible acne, panic attacks 4ish hours after dosing if one doesn’t dose again, lowkey constant anxiety throughout the time while on it, severe withdrawal symptoms when trying to get off, extreme drowsiness, blurred vision, dry eyes and mouth, stumbling around, nodding heavily, almost blacking out when standing up, swelling and pain in ankle area, waking up with a headache, ringing ears, bad anxiety during and after use, less euphoria, more sleepiness, breathing issues, stuffy nose, going into blackouts, doing weird things like snapping fingers or making vocalizations while falling asleep or being high. However, some summaries state that the text does not mention any side effects of xylazine.
What does xylazine do to the skin?Xylazine can cause various negative effects on the skin, such as necrosis (eating away at flesh), sores that may lead to sepsis, rub/itch sensation leading to raw and painful skin with scabs forming around eyes and nose, horrific wounds, rash, redness in combination with white bump lines from scratching, limb skin lesions, ulcerations, nasty, ulcerative wounds progressing to chronic osteo, damage to veins (but not directly affecting the appearance of the skin), random bruises.
How does xylazine impact rehab?Xylazine significantly impacts individuals seeking rehab by causing hard cravings, skin necrosis, and severe withdrawal symptoms that are difficult to manage. It’s often found mixed with opioids like fentanyl, complicating overdose responses since it’s resistant to naloxone (Narcan), a common opioid overdose reversal agent. This lack of a pharmaceutical antidote for xylazine means that while naloxone can reverse opioid effects, it cannot counteract xylazine’s effects. Furthermore, xylazine contributes to the difficulty of finding resources and information for recovery, making detoxification and rehab more challenging. It also induces symptoms like horrific sickness if not used regularly, worsening withdrawal experiences, and causing unexpected side effects when mixed with other drugs. Additionally, xylazine can cause persistent psychosis or discomfort upon withdrawal, and its presence in the drug supply may cause individuals to seek help due to health concerns or worsen their desire for drug use despite rehab efforts.
What is xylazine withdrawal like?Xylazine withdrawal seems to be characterized by symptoms such as overwhelming tiredness, black out periods, minimal euphoria or other opioid associated symptoms like itchiness, heart rate and blood pressure associated issues, difficulty with heat tolerance, unexplained sores on various parts of the body regardless of method of use, clogged nose, bad dry mouth, slow wound healing, abscesses/skin ulcers (if injected), lightheadedness, and sleep disturbances when dosing early in the day. Methadone is not effective for stopping xylazine withdrawal symptoms. Xylazine withdrawal can be unpleasant and last for a longer duration compared to fentanyl powder withdrawals. It causes both opiate and benzo withdrawal symptoms and can cause seizures as well, making it a dangerous process for those who are addicted to it. Some people experience extreme fatigue and difficulty moving during xylazine withdrawal, while others have described it as being similar to the tiredness they feel during drug withdrawals.
What drugs contain xylazine?Drugs that may contain Xylazine include fentanyl analogues, heroin (specifically "dope"), animal sedative as a street drug in Puerto Rico, mixtures called "tranq" or "trank," a combination with fentanyl and other opioids, drugs laced with fentanyl and xylazine, ketamine-xylazine combinations, methamphetamines when cut into the drug supply, counterfeit m30 "blues", pressed benzos, and possibly heroin or similar drugs that opioid addicts have to deal with.
What treatments work for xylazine?There is no clear information about what works to treat xylazine based solely on these summaries, as most of them state that there are no known effective treatments mentioned in the given paragraphs. Some mention potential reversal agents like yohimbine and atipamezole but only from a veterinary perspective.
What drugs are mixed or cut with xylazine?Heroin, fentanyl (including "dirty 30s"), ketamine, counterfeit percocet and xanax pills, and other unspecified opioids ("tranq dope") may be mixed or cut with xylazine.
What areas of the US are impacted by xylazine?Xylazine has been detected in various regions of the United States, including the northeast, midwest, and west coast. It is commonly heard about in Pennsylvania and Puerto Rico.
How is xylazine different from pure heroin?Xylazine, also known as "tranq," is a sedative/tranquilizer commonly used for horses and is different from pure heroin in several ways. Firstly, it’s not an opiate; instead, it’s a tranquilizer that produces a somniferous effect or a “nod” which can be almost unachievable off opiates alone. This means users become physically dependent on xylazine and require its presence in their drug mix to achieve the desired effects. Additionally, unlike pure heroin, overdoses of xylazine cannot be reversed using Narcan (naloxone), an opiate antagonist that works by ripping off the opiates bound to the opioid receptors.
What is the general sentiment associated with xylazine?The overall sentiment towards Xylazine appears to be predominantly negative or alarming due to its association with various adverse effects such as skin ulcerations, overdose deaths, necrotic lesions, and other health issues. It has been linked to an increasing number of overdose deaths nationwide, making overdose management nearly impossible since there is no known agent to reverse its effects. Xylazine’s presence in the drug supply also poses "grave threats and danger" according to Dr Raul Gupta from the White House Director of Drug Control Policy.
Does narcan or naloxone work for xylazine overdose?Narcan (naloxone) is ineffective for xylazine overdoses because xylazine is not an opioid, and naloxone cannot reverse its effects.
What are the side effects of ketamine?The summaries do not mention specific side effects for ketamine in most cases, but some discuss bladder issues and increased aggression in mice exposed to early life stress.
What is ketamine withdrawal like?Ketamine withdrawal seems to involve physical symptoms such as puking, chest pain, stomach pain, runny nose, restless legs, twitching all the time and mental symptoms including difficulty sleeping, eating or hallucinating. The individual experienced intense discomfort for 5 days without sleep before managing some relief through ketamine use and a hot bath. Withdrawal symptoms persisted even after this initial period of discomfort.
What are k cramps like?K-cramp is described as terrible cramps that feel like a burning sensation in one’s stomach, and it may also cause back pain. K cramps involve intense upper stomach pain along with acid reflux and some nausea/vomiting. They can be severe pains that start as a slight discomfort deep in the chest/sternum, similar to heartburn but heavier, reaching down into the upper abdomen and eventually creeping through the chest into the inner top part of the back. K-cramp stories involve excruciating pain, sweating, muscles seizing up, and vomiting. They can be so intense that they cause a person to feel as if they want to rip their organs out or accept death due to the unbearable pain.
How do the users describe k hole?The users describe K-hole experiences as intense, otherworldly, and often difficult to put into words. Some mention feelings of ego death or connection with the universe, while others experience visuals like moving geometric shapes or being in strange subterranean realms. The K-hole can also be associated with sensations of floating, flying, levitation, or feeling disconnected from reality. However, some users describe their experiences as dull and not psychedelic. Some mention panic attacks, nausea, or feelings of impending doom during the experience. A few users report negative experiences such as self-injury or messy surroundings due to mobility during the hole.
Does ketamine work for depression?Yes, according to various sources including Balanced Mental Wellness, a leading mental health provider, research papers and Dr. Ashraf Hanna, director of pain medicine at Florida Spine Institute in Clearwater, Florida, ketamine has been proven effective as an antidepressant treatment option for individuals struggling with depression or PTSD.
What drugs are ketamine co-used with recreationally?Ketamine is reportedly co-used with MDMA, cocaine or co*ke, LSD, Kavva, Kanna, Lean and stimulants like MDA at parties. It’s also mentioned to be used in conjunction with hallucinogenics such as psilocybin mushrooms and fentanyl at concert events. However, there is no specific mention of other drugs that are co-used with ketamine recreationally within the given paragraph for some summaries.
Is ketamine effective for treatment of suicidal behavior?Yes, according to various sources mentioned in these summaries, ketamine has been found to be an effective and fast-acting treatment for reducing or eliminating suicidal thoughts and behaviors. It is particularly useful for individuals experiencing suicidal thoughts who need immediate relief. Ketamine’s mechanism of action involves several pathways but primarily targets the glutamatergic system, which contributes to its antidepressant and anti-suicidal properties. Clinical trials have shown rapid efficacy in reducing suicidal ideation for patients suffering from major depressive disorder (MDD), bipolar depression, PTSD, cancer or other conditions.
How can you treat ketamine addiction?There is no information in these summaries about how to treat Ketamine addiction, but they do discuss a study that used ketamine as a potential new treatment approach for behavioral addictions such as gambling disorder, internet gaming disorder, binge eating disorder and compulsive sexual behavior.
Does ketamine use cause cramps?The summaries are inconclusive about whether ketamine causes cramps, as some say it does while others state that there is no mention of such an effect in the given paragraph.
Two-layer retrieval augmented generation framework for low-resource medical question-answering: proof of concept using Reddit data (2024)
Top Articles
Latest Posts
Article information

Author: Nathanael Baumbach

Last Updated:

Views: 6034

Rating: 4.4 / 5 (75 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Nathanael Baumbach

Birthday: 1998-12-02

Address: Apt. 829 751 Glover View, West Orlando, IN 22436

Phone: +901025288581

Job: Internal IT Coordinator

Hobby: Gunsmithing, Motor sports, Flying, Skiing, Hooping, Lego building, Ice skating

Introduction: My name is Nathanael Baumbach, I am a fantastic, nice, victorious, brave, healthy, cute, glorious person who loves writing and wants to share my knowledge and understanding with you.