Spaces:

Pupolina
/

Academic_Research_Assistant

Runtime error

App Files Files Community

Pupolina commited on Nov 9, 2024

Commit

c57baf4

verified ·

1 Parent(s): 2ba8b66

base functionality upload

Browse files

Files changed (3) hide show

README.md +57 -12
functionality.py +274 -0
main.py +76 -0

README.md CHANGED Viewed

@@ -1,12 +1,57 @@
----
-title: Academic Research Assistant
-emoji: 😻
-colorFrom: yellow
-colorTo: purple
-sdk: gradio
-sdk_version: 5.5.0
-app_file: app.py
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# AI-Powered Academic Research Assistant
+The **AI-Powered Academic Research Assistant** is a comprehensive, online chatbot designed to assist students and researchers in crafting high-quality academic works. This application leverages advanced AI technology to generate, refine, and analyze academic writing, allowing users to maintain formal style, grammatical precision, and logical coherence throughout their work.
+## Features
+#### The main features of the AI-Powered Academic Research Assistant include:
+- **Text Generation**: The chatbot assists in generating contextually relevant academic text based on prompts or partial content already provided by the user. This helps users build on their ideas and expand on specific topics.
+- **Grammar Correction**: Uses a *T5 Base Grammar Correction Model* to ensure grammatical accuracy and clarity in the user's writing.
+- *Formal Style Analysis*: Evaluates the academic tone and formal style of the text, ensuring adherence to scholarly writing conventions via a specialized *Style Transformer*.
+## Technical Architecture
+The application’s architecture is built to provide fast, reliable, and high-quality academic assistance using the following components:
+### Language Models and Components
+- [**Qwen2.5 1.5B Instruct**](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct): The main large language model (LLM) used for text generation, capable of producing nuanced and detailed academic content.
+- [**T5 Base Grammar Correction Model**](https://huggingface.co/vennify/t5-base-grammar-correction): Ensures grammatical correctness by identifying and correcting grammar issues in the user's input text.
+- [**Style Transformer**](https://github.com/PrithivirajDamodaran/Styleformer): Preserves and enhances the formal academic style, adapting responses to align with scholarly writing conventions.
+- **RAG (Retrieval-Augmented Generation) System**: A vector database that consists of a carefully curated collection of academic works, which serves as a knowledge base for the LLM. This dataset provides the model with a rich source of context and reference, ensuring responses are both accurate and relevant ([Base Dataset](https://huggingface.co/datasets/somosnlp-hackathon-2022/scientific_papers_en/viewer/default/train?row=0)).
+- **Routing Approach**: Optimizes performance by directing specific types of user queries to the most suitable processing components within the application.
+### Front-End Interface
+The front end of the application is designed for intuitive interaction, using:
+- **Gradio API**: Simplifies the creation of an interactive and user-friendly front-end.
+- **Deployment on Hugging Face Spaces**: Makes the app easily accessible online as a Hugging Face demo, enabling users to experience the chatbot’s features with minimal setup.
+## Installation and Usage
+To use the AI-Powered Academic Research Assistant you can:
+1. **Try the application online** following the [Demo link](link) and accessing the deployed app via Hugging Face Spaces.
+2. **Run local instance**:
+```bash
+git clone https://github.com/Pupolina7/ResearchAssistant
+```
+```bash
+pip install -r requirements.txt
+```
+```bash
+python main.py
+```
+## Future Enhancements
+To expand the functionality and user experience of the **AI-Powered Academic Research Assistant**, the following features are planned for future updates:
+- **Exclude Page Jumps**: Address the current limitations in Gradio’s handling of real-time text updates within the response generation box, reducing interruptions and improving the flow of content as it’s generated. This feature will be implemented as soon as Gradio offers an update to support it.
+- **'Stop Generation' Button**: Introduce a button that allows users to halt the text generation process mid-response, offering more control over the interaction.
+- **File Content Writing**: Enable the assistant to directly write generated or improved content into user-uploaded files, making it easier to incorporate edits without manual copy-pasting.
+- **Markdown Support**: Enhance the app’s capabilities to interpret and generate Markdown (MD) formatted content, ensuring compatibility with popular text editors and document formats.
+- **LaTeX Support**: Add support for LaTeX, enabling researchers in fields like mathematics, physics, and engineering to input and output complex equations and scientific notations seamlessly.
+- **Improvement Hints**: Provide contextual hints and suggestions for enhancing the clarity, coherence, or academic rigor of the user’s text, offering actionable feedback to elevate the quality of writing.

functionality.py ADDED Viewed

	@@ -0,0 +1,274 @@

+import warnings
+from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer
+from happytransformer import HappyTextToText, TTSettings
+from styleformer import Styleformer
+from sentence_transformers import SentenceTransformer
+import chromadb
+import pandas as pd
+import logging
+import re
+from threading import Thread
+import hashlib
+import diskcache as dc
+import nltk
+nltk.download('punkt_tab')
+warnings.filterwarnings("ignore")
+logging.basicConfig(level=logging.INFO, # filename="py_log.log",filemode="w",
+                    format="%(asctime)s %(levelname)s %(message)s", datefmt="%Y-%m-%d %H:%M:%S")
+# For chromadb collection
+MAX_TOKENS = 512
+client = chromadb.Client()
+embedder = SentenceTransformer('all-MiniLM-L6-v2')
+collection_name = 'papers'
+# For grammar checker
+happy_tt = HappyTextToText("T5", "vennify/t5-base-grammar-correction")
+grammar_cache = dc.Cache('grammar_cache')
+# For academic style checks
+sf = Styleformer(style=0)
+style_cache = dc.Cache('style_cache')
+# For text generation
+model_name = "Qwen/Qwen2.5-1.5B-Instruct"
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype="auto",
+    device_map="auto"
+)
+model.generation_config.max_new_tokens = 2048
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model_cache = dc.Cache('model_cache')
+def generate_key(text):
+    return hashlib.md5(text.encode()).hexdigest()
+def split_into_chunks(text, max_tokens=MAX_TOKENS):
+    sentences = nltk.sent_tokenize(text)
+    chunks, current = [], ""
+    current_tokens = 0
+    for sentence in sentences:
+        sentence_tokens = len(sentence.split())
+        if current_tokens + sentence_tokens <= max_tokens:
+            current += sentence + ' '
+            current_tokens += sentence_tokens
+        else:
+            chunks.append(current.strip())
+            current, current_tokens = sentence + ' ', sentence_tokens
+    if current:
+        chunks.append(current.strip())
+    return chunks
+# def split_into_chunks(text, max_tokens=MAX_TOKENS):
+#     sentences = text.split(". ")
+#     chunks = []
+#     current = ""
+#     for sentence in sentences:
+#         if len(current.split()) + len(sentence.split()) <= max_tokens:
+#             current += sentence + '. '
+#         else:
+#             chunks.append(current.strip())
+#             current = sentence + '. '
+#     if current:
+#         chunks.append(current.strip())
+#     return chunks
+def clean_text(text):
+    # Remove newlines within sentences but keep paragraph breaks
+    text = re.sub(r'\n(?!\n)', ' ', text)
+    # Remove multiple newlines, keeping only double newlines for paragraphs
+    text = re.sub(r'\n{2,}', '\n\n', text)
+    # Rejoin hyphenated words split across lines
+    text = re.sub(r'(\w)-\s+(\w)', r'\1\2', text)
+    # Remove citation brackets and figure numbers
+    text = re.sub(r'\[\d+\]', '', text)  # Removes [7], [6], etc.
+    text = re.sub(r'Fig\.|Figure', '', text)  # Removes "Fig." or "Figure" references
+    # Strip leading/trailing spaces from each paragraph
+    paragraphs = text.split('\n')
+    cleaned_paragraphs = [para.strip() for para in paragraphs if para.strip()]
+    # Join cleaned paragraphs back with double newlines for readability
+    cleaned_text = '\n\n'.join(cleaned_paragraphs)
+    return cleaned_text
+def get_collection() -> chromadb.Collection:
+    collection_names = [collection.name for collection in client.list_collections()]
+    logging.info(f"Client collection names: {collection_names}")
+    if collection_name not in collection_names:
+        logging.info(f"Creation of a collection...")
+        collection = client.create_collection(name=collection_name)
+        papers = pd.read_csv("hf://datasets/somosnlp-hackathon-2022/scientific_papers_en/scientific_paper_en.csv")
+        logging.info(f"The data downloaded from url.")
+        papers = papers.drop(['id'], axis=1)
+        papers = papers.iloc[:200]
+        for i in range(200):
+            paper = papers.iloc[i]
+            idx = paper.name
+            full_text = clean_text('Abstract ' + paper['abstract'] + ' ' + paper['text_no_abstract'])
+            chunks = split_into_chunks(full_text)
+            for id, chunk in enumerate(chunks):
+                embeddings = embedder.encode([chunk])
+                collection.upsert(ids=f"paper{idx}_chunk_{id}",
+                                documents=[chunk],
+                                embeddings=embeddings,)
+            logging.info(f"Collection upsert: The content of paper_{idx} was chunked and collected in vector db!")
+        logging.info(f"Collection is filled!\n")
+    else:
+        collection = client.get_collection(name=collection_name)
+        logging.info(f"Collection '{collection_name}' already exists!")
+    return collection
+def fix_grammar(text: str):
+    logging.info(f"\n---Fix Grammar input:---\n{text}")
+    key = generate_key(text)
+    if key in  grammar_cache:
+        logging.info(f"Similar request was found in 'grammar_cache' and retrieved from it!")
+        yield grammar_cache[key]
+    else:
+        args = TTSettings(num_beams=5, min_length=1)
+        chunks = split_into_chunks(text=text, max_tokens=40)
+        corrected_text = ""
+        error_flag = False
+        for chunk in chunks:
+            try:
+                result = happy_tt.generate_text(f"grammar: {chunk}", args=args)
+                corrected_part = f"{result.text} "
+            except Exception as e:
+                error_flag = True
+                logging.error(f"Error correcting grammar: {e}")
+                corrected_part = f"{chunk} "
+            corrected_text += corrected_part
+            yield corrected_text
+        if not error_flag:
+            grammar_cache.set(key, corrected_text, expire=86400)
+            logging.info(f"The result was cached in 'grammar_cache'!")
+def fix_academic_style(informal_text: str):
+    logging.info(f"\n---Fix Academic Style input:---\n{informal_text}")
+    key = generate_key(informal_text)
+    if key in style_cache:
+        logging.info(f"Similar request was found in 'style_cache' and retrieved from it!")
+        yield style_cache[key]
+    else:
+        chunks = split_into_chunks(text=informal_text, max_tokens=25)
+        formal_text = ""
+        error_flag = False
+        for chunk in chunks:
+            try:
+                corrected_part = sf.transfer(chunk)
+                if corrected_part is None:
+                    error_flag = True
+                    corrected_part = f"{chunk} "
+                    logging.warning("---COULD NOT FIX ACADEMIC STYLE!\n")
+                else:
+                    corrected_part = f"{corrected_part} "
+            except Exception as e:
+                error_flag = True
+                logging.error(f"Error in academic style transformation: {e}")
+                corrected_part = f"{chunk} "
+            formal_text += corrected_part
+            yield formal_text
+        if not error_flag:
+            style_cache.set(key, formal_text, expire=86400)
+            logging.info(f"The result was cached in 'style_cache'!")
+def _chat_stream(initial_text: str, parts: list):
+    logging.info(f"\n---Generate Article input:---\n{initial_text}")
+    parts = ", ".join(parts).lower()
+    for_cache = initial_text + ' ' + parts
+    key = generate_key(for_cache)
+    if key in model_cache:
+        logging.info(f"Similar request was found in 'model_cache' and retrieved from it!")
+        yield model_cache[key]
+    else:
+        text_embedding = embedder.encode([initial_text])
+        chroma_collection = get_collection()
+        results = chroma_collection.query(
+            query_embeddings=text_embedding,
+            n_results=1
+        )
+        context = results['documents'][0] if results['documents'] else ""
+        if context == "":
+            logging.warning(f"COLLECTION QUERY: No context was found in the database!")
+        messages = [
+        {"role": "system", "content": """You are helpful Academic Research Assistant which helps to generate
+                                    necessary parts of the reserch based on the provided context.
+                                    The context is the following: 'written text' - this is the text that user
+                                    has for now and want to complete, 'parts' - those are the parts of paper
+                                    user needs to complete (it could be the abstract, introduction, methodology,
+                                    discussion, conclusion, or full text), 'context' - the similar article
+                                    the structure of which can be used as a base for the text (it can be empty
+                                    in case of absence of similar papers in the database.). The output should be
+                                    only generated article (or parts of it). The responce must be provided as a
+                                    raw text. Be precise and follow the structure of academic papers parts."""},
+        {"role": "user", "content": f"'written text': {initial_text}\n 'parts': {parts}\n 'context': {context}"},
+        ]
+        input_text = tokenizer.apply_chat_template(
+            messages,
+            add_generation_prompt=True,
+            tokenize=False,
+        )
+        inputs = tokenizer([input_text], return_tensors="pt").to(model.device)
+        streamer = TextIteratorStreamer(
+            tokenizer=tokenizer, skip_prompt=True, timeout=60.0, skip_special_tokens=True
+        )
+        generation_kwargs = {
+            **inputs,
+            "streamer": streamer,
+        }
+        thread = Thread(target=model.generate, kwargs=generation_kwargs)
+        thread.start()
+        response = ""
+        for new_text in streamer:
+            response += new_text
+            yield response
+        model_cache.set(key, response, expire=86400)
+        logging.info(f"The result was cached in 'model_cache'!")
+def predict(goal: str, parts: list, context: str):
+        if context == "":
+            yield "Write your text first!"
+            logging.info("No context was provided!")
+        elif goal == 'Fix Academic Style':
+            formal_text = ""
+            for new_text in fix_academic_style(context):
+                formal_text = new_text
+                yield formal_text
+            logging.info(f"\n---Academic style corrected:---\n {formal_text}\n")
+        elif goal == 'Fix Grammar':
+            full_response = ""
+            for new_text in fix_grammar(context):
+                full_response = new_text
+                yield full_response
+            logging.info(f"\n---Grammar corrected:---\n{full_response}\n")
+        else:
+            full_response = ""
+            for new_text in _chat_stream(context, parts):
+                full_response = new_text
+                yield full_response
+            logging.info(f"\nThe text was generated!\n{full_response}")

main.py ADDED Viewed

	@@ -0,0 +1,76 @@

+import gradio as gr
+from functionality import get_collection, predict
+def read_file(file):
+    if file is not None:
+        with open(file.name, 'r') as f:
+            file_data = f.read()
+        return file_data
+    else:
+        return None
+with gr.Blocks() as demo:
+    gr.Markdown("# **📝 AI-powered Academic Research Assistant 📝**")
+    gr.Markdown("""**AI-powered Academic Research Assistant** is a tool which helps to
+                ensure the *correct grammar* and *academic style* in the scientific papers.
+                It also could help with *writing needed parts* or *proposing possible ideas*
+                for describing what you want in appropriate way.
+                ## 📥 Down bellow you should choose appropriate parameters for your goals and then wait a little for the responce!""")
+    gr.Markdown('📨 Write the text you what to expand or upload corresponding text file.')
+    with gr.Tab('Write Text 📖'):
+        gr.Markdown("⚙️ *Hint*: to ensure more effective work of 'Fix Academic Style' try to make your sentences not too long (<= 20 words).")
+        input_prompt = gr.Textbox(label='Initial Text 📝',
+                            placeholder='Write here your research text!',
+                            lines=9,)
+    with gr.Tab('Upload File 📩'):
+        gr.Markdown("⚙️ *Hint*: to ensure more effective work of 'Fix Academic Style' try to make your sentences not too long (<= 20 words).")
+        txt_file = gr.File(file_types=['text',], label='Upload Text File',)
+        txt_file.change(read_file, inputs=txt_file, outputs=input_prompt)
+    gr.Markdown('✏️ Fill parameters for your needs')
+    with gr.Row(variant='panel', equal_height=True):
+        request_goal = gr.Radio(label='🤔 Specify the purpose of your request.',
+                                info="Pick one:",
+                                choices=['Write Text (Part)', 'Fix Academic Style', 'Fix Grammar', ],
+                                value='Write Text (Part)',)
+        with gr.Accordion("❗️ In case you need to Write Text (Part) choose appropriate option!", open=False):
+            part_to_write = gr.CheckboxGroup(label="""📋 What part for Assistant to write? (here you need to
+                                             specify what part of your research you need to complete.)""",
+                                            info="""You may chose as many as needed:""",
+                                            choices=['Abstract', 'Introduction',
+                                                     'Methodology', 'Discussion', 'Conclusion', 'Full Text',],
+                                            value='Abstract',)
+    with gr.Row(equal_height=True):
+        submit_btn = gr.Button('Confirm! ✅')
+        clear_btn = gr.Button('Clear ❌', min_width=611)
+    gr.Markdown('##### 📌 Assistant Responce')
+    gr.Markdown("In case you did not satisfy with the responce try to paraphrase!")
+    responce = gr.Textbox(label="Generated Text 👨🏼‍💻",
+                        info="""You may face some page jumps, it is a bug which will be fixed. Just wait for the completion of text generation.
+                        Sorry for inconvenience(""",
+                        lines=9,
+                        placeholder='Here the generated text will appear!',
+                        show_label=True,
+                        show_copy_button=True,
+                        autofocus=True,
+                        autoscroll=True,)
+    submit_btn.click(fn=predict,
+                    inputs=[request_goal, part_to_write, input_prompt,],
+                    outputs=[responce],
+                    scroll_to_output=True,)
+    clear_btn.click(lambda: (None, None, 'Write Text (Part)', 'Abstract', None), None,
+                outputs=[input_prompt, txt_file, request_goal, part_to_write, responce])
+if __name__ == "__main__":
+    get_collection()
+    demo.launch()