Pupolina commited on
Commit
c57baf4
Β·
verified Β·
1 Parent(s): 2ba8b66

base functionality upload

Browse files
Files changed (3) hide show
  1. README.md +57 -12
  2. functionality.py +274 -0
  3. main.py +76 -0
README.md CHANGED
@@ -1,12 +1,57 @@
1
- ---
2
- title: Academic Research Assistant
3
- emoji: 😻
4
- colorFrom: yellow
5
- colorTo: purple
6
- sdk: gradio
7
- sdk_version: 5.5.0
8
- app_file: app.py
9
- pinned: false
10
- ---
11
-
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AI-Powered Academic Research Assistant
2
+ The **AI-Powered Academic Research Assistant** is a comprehensive, online chatbot designed to assist students and researchers in crafting high-quality academic works. This application leverages advanced AI technology to generate, refine, and analyze academic writing, allowing users to maintain formal style, grammatical precision, and logical coherence throughout their work.
3
+
4
+ ## Features
5
+
6
+ #### The main features of the AI-Powered Academic Research Assistant include:
7
+
8
+ - **Text Generation**: The chatbot assists in generating contextually relevant academic text based on prompts or partial content already provided by the user. This helps users build on their ideas and expand on specific topics.
9
+ - **Grammar Correction**: Uses a *T5 Base Grammar Correction Model* to ensure grammatical accuracy and clarity in the user's writing.
10
+ - *Formal Style Analysis*: Evaluates the academic tone and formal style of the text, ensuring adherence to scholarly writing conventions via a specialized *Style Transformer*.
11
+
12
+ ## Technical Architecture
13
+
14
+ The application’s architecture is built to provide fast, reliable, and high-quality academic assistance using the following components:
15
+
16
+ ### Language Models and Components
17
+ - [**Qwen2.5 1.5B Instruct**](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct): The main large language model (LLM) used for text generation, capable of producing nuanced and detailed academic content.
18
+ - [**T5 Base Grammar Correction Model**](https://huggingface.co/vennify/t5-base-grammar-correction): Ensures grammatical correctness by identifying and correcting grammar issues in the user's input text.
19
+ - [**Style Transformer**](https://github.com/PrithivirajDamodaran/Styleformer): Preserves and enhances the formal academic style, adapting responses to align with scholarly writing conventions.
20
+ - **RAG (Retrieval-Augmented Generation) System**: A vector database that consists of a carefully curated collection of academic works, which serves as a knowledge base for the LLM. This dataset provides the model with a rich source of context and reference, ensuring responses are both accurate and relevant ([Base Dataset](https://huggingface.co/datasets/somosnlp-hackathon-2022/scientific_papers_en/viewer/default/train?row=0)).
21
+ - **Routing Approach**: Optimizes performance by directing specific types of user queries to the most suitable processing components within the application.
22
+
23
+ ### Front-End Interface
24
+ The front end of the application is designed for intuitive interaction, using:
25
+
26
+ - **Gradio API**: Simplifies the creation of an interactive and user-friendly front-end.
27
+ - **Deployment on Hugging Face Spaces**: Makes the app easily accessible online as a Hugging Face demo, enabling users to experience the chatbot’s features with minimal setup.
28
+
29
+ ## Installation and Usage
30
+ To use the AI-Powered Academic Research Assistant you can:
31
+ 1. **Try the application online** following the [Demo link](link) and accessing the deployed app via Hugging Face Spaces.
32
+
33
+ 2. **Run local instance**:
34
+ ```bash
35
+ git clone https://github.com/Pupolina7/ResearchAssistant
36
+ ```
37
+ ```bash
38
+ pip install -r requirements.txt
39
+ ```
40
+ ```bash
41
+ python main.py
42
+ ```
43
+
44
+ ## Future Enhancements
45
+ To expand the functionality and user experience of the **AI-Powered Academic Research Assistant**, the following features are planned for future updates:
46
+
47
+ - **Exclude Page Jumps**: Address the current limitations in Gradio’s handling of real-time text updates within the response generation box, reducing interruptions and improving the flow of content as it’s generated. This feature will be implemented as soon as Gradio offers an update to support it.
48
+ - **'Stop Generation' Button**: Introduce a button that allows users to halt the text generation process mid-response, offering more control over the interaction.
49
+ - **File Content Writing**: Enable the assistant to directly write generated or improved content into user-uploaded files, making it easier to incorporate edits without manual copy-pasting.
50
+ - **Markdown Support**: Enhance the app’s capabilities to interpret and generate Markdown (MD) formatted content, ensuring compatibility with popular text editors and document formats.
51
+ - **LaTeX Support**: Add support for LaTeX, enabling researchers in fields like mathematics, physics, and engineering to input and output complex equations and scientific notations seamlessly.
52
+ - **Improvement Hints**: Provide contextual hints and suggestions for enhancing the clarity, coherence, or academic rigor of the user’s text, offering actionable feedback to elevate the quality of writing.
53
+
54
+
55
+
56
+
57
+
functionality.py ADDED
@@ -0,0 +1,274 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import warnings
2
+ from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer
3
+ from happytransformer import HappyTextToText, TTSettings
4
+ from styleformer import Styleformer
5
+ from sentence_transformers import SentenceTransformer
6
+ import chromadb
7
+ import pandas as pd
8
+ import logging
9
+ import re
10
+ from threading import Thread
11
+ import hashlib
12
+ import diskcache as dc
13
+ import nltk
14
+ nltk.download('punkt_tab')
15
+
16
+ warnings.filterwarnings("ignore")
17
+ logging.basicConfig(level=logging.INFO, # filename="py_log.log",filemode="w",
18
+ format="%(asctime)s %(levelname)s %(message)s", datefmt="%Y-%m-%d %H:%M:%S")
19
+
20
+
21
+ # For chromadb collection
22
+ MAX_TOKENS = 512
23
+ client = chromadb.Client()
24
+ embedder = SentenceTransformer('all-MiniLM-L6-v2')
25
+ collection_name = 'papers'
26
+
27
+ # For grammar checker
28
+ happy_tt = HappyTextToText("T5", "vennify/t5-base-grammar-correction")
29
+ grammar_cache = dc.Cache('grammar_cache')
30
+
31
+ # For academic style checks
32
+ sf = Styleformer(style=0)
33
+ style_cache = dc.Cache('style_cache')
34
+
35
+ # For text generation
36
+ model_name = "Qwen/Qwen2.5-1.5B-Instruct"
37
+
38
+ model = AutoModelForCausalLM.from_pretrained(
39
+ model_name,
40
+ torch_dtype="auto",
41
+ device_map="auto"
42
+ )
43
+ model.generation_config.max_new_tokens = 2048
44
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
45
+ model_cache = dc.Cache('model_cache')
46
+
47
+ def generate_key(text):
48
+ return hashlib.md5(text.encode()).hexdigest()
49
+
50
+
51
+ def split_into_chunks(text, max_tokens=MAX_TOKENS):
52
+ sentences = nltk.sent_tokenize(text)
53
+ chunks, current = [], ""
54
+ current_tokens = 0
55
+
56
+ for sentence in sentences:
57
+ sentence_tokens = len(sentence.split())
58
+ if current_tokens + sentence_tokens <= max_tokens:
59
+ current += sentence + ' '
60
+ current_tokens += sentence_tokens
61
+ else:
62
+ chunks.append(current.strip())
63
+ current, current_tokens = sentence + ' ', sentence_tokens
64
+ if current:
65
+ chunks.append(current.strip())
66
+ return chunks
67
+
68
+
69
+ # def split_into_chunks(text, max_tokens=MAX_TOKENS):
70
+ # sentences = text.split(". ")
71
+ # chunks = []
72
+ # current = ""
73
+ # for sentence in sentences:
74
+ # if len(current.split()) + len(sentence.split()) <= max_tokens:
75
+ # current += sentence + '. '
76
+ # else:
77
+ # chunks.append(current.strip())
78
+ # current = sentence + '. '
79
+ # if current:
80
+ # chunks.append(current.strip())
81
+ # return chunks
82
+
83
+ def clean_text(text):
84
+ # Remove newlines within sentences but keep paragraph breaks
85
+ text = re.sub(r'\n(?!\n)', ' ', text)
86
+
87
+ # Remove multiple newlines, keeping only double newlines for paragraphs
88
+ text = re.sub(r'\n{2,}', '\n\n', text)
89
+
90
+ # Rejoin hyphenated words split across lines
91
+ text = re.sub(r'(\w)-\s+(\w)', r'\1\2', text)
92
+
93
+ # Remove citation brackets and figure numbers
94
+ text = re.sub(r'\[\d+\]', '', text) # Removes [7], [6], etc.
95
+ text = re.sub(r'Fig\.|Figure', '', text) # Removes "Fig." or "Figure" references
96
+
97
+ # Strip leading/trailing spaces from each paragraph
98
+ paragraphs = text.split('\n')
99
+ cleaned_paragraphs = [para.strip() for para in paragraphs if para.strip()]
100
+
101
+ # Join cleaned paragraphs back with double newlines for readability
102
+ cleaned_text = '\n\n'.join(cleaned_paragraphs)
103
+
104
+ return cleaned_text
105
+
106
+ def get_collection() -> chromadb.Collection:
107
+ collection_names = [collection.name for collection in client.list_collections()]
108
+ logging.info(f"Client collection names: {collection_names}")
109
+ if collection_name not in collection_names:
110
+ logging.info(f"Creation of a collection...")
111
+ collection = client.create_collection(name=collection_name)
112
+ papers = pd.read_csv("hf://datasets/somosnlp-hackathon-2022/scientific_papers_en/scientific_paper_en.csv")
113
+ logging.info(f"The data downloaded from url.")
114
+ papers = papers.drop(['id'], axis=1)
115
+ papers = papers.iloc[:200]
116
+
117
+ for i in range(200):
118
+ paper = papers.iloc[i]
119
+ idx = paper.name
120
+
121
+ full_text = clean_text('Abstract ' + paper['abstract'] + ' ' + paper['text_no_abstract'])
122
+ chunks = split_into_chunks(full_text)
123
+
124
+ for id, chunk in enumerate(chunks):
125
+ embeddings = embedder.encode([chunk])
126
+ collection.upsert(ids=f"paper{idx}_chunk_{id}",
127
+ documents=[chunk],
128
+ embeddings=embeddings,)
129
+ logging.info(f"Collection upsert: The content of paper_{idx} was chunked and collected in vector db!")
130
+
131
+ logging.info(f"Collection is filled!\n")
132
+ else:
133
+ collection = client.get_collection(name=collection_name)
134
+ logging.info(f"Collection '{collection_name}' already exists!")
135
+ return collection
136
+
137
+ def fix_grammar(text: str):
138
+ logging.info(f"\n---Fix Grammar input:---\n{text}")
139
+ key = generate_key(text)
140
+ if key in grammar_cache:
141
+ logging.info(f"Similar request was found in 'grammar_cache' and retrieved from it!")
142
+ yield grammar_cache[key]
143
+
144
+ else:
145
+ args = TTSettings(num_beams=5, min_length=1)
146
+ chunks = split_into_chunks(text=text, max_tokens=40)
147
+ corrected_text = ""
148
+ error_flag = False
149
+ for chunk in chunks:
150
+ try:
151
+ result = happy_tt.generate_text(f"grammar: {chunk}", args=args)
152
+ corrected_part = f"{result.text} "
153
+ except Exception as e:
154
+ error_flag = True
155
+ logging.error(f"Error correcting grammar: {e}")
156
+ corrected_part = f"{chunk} "
157
+ corrected_text += corrected_part
158
+ yield corrected_text
159
+
160
+ if not error_flag:
161
+ grammar_cache.set(key, corrected_text, expire=86400)
162
+ logging.info(f"The result was cached in 'grammar_cache'!")
163
+
164
+ def fix_academic_style(informal_text: str):
165
+ logging.info(f"\n---Fix Academic Style input:---\n{informal_text}")
166
+ key = generate_key(informal_text)
167
+ if key in style_cache:
168
+ logging.info(f"Similar request was found in 'style_cache' and retrieved from it!")
169
+ yield style_cache[key]
170
+
171
+ else:
172
+ chunks = split_into_chunks(text=informal_text, max_tokens=25)
173
+ formal_text = ""
174
+ error_flag = False
175
+ for chunk in chunks:
176
+ try:
177
+ corrected_part = sf.transfer(chunk)
178
+ if corrected_part is None:
179
+ error_flag = True
180
+ corrected_part = f"{chunk} "
181
+ logging.warning("---COULD NOT FIX ACADEMIC STYLE!\n")
182
+ else:
183
+ corrected_part = f"{corrected_part} "
184
+ except Exception as e:
185
+ error_flag = True
186
+ logging.error(f"Error in academic style transformation: {e}")
187
+ corrected_part = f"{chunk} "
188
+ formal_text += corrected_part
189
+ yield formal_text
190
+
191
+ if not error_flag:
192
+ style_cache.set(key, formal_text, expire=86400)
193
+ logging.info(f"The result was cached in 'style_cache'!")
194
+
195
+ def _chat_stream(initial_text: str, parts: list):
196
+ logging.info(f"\n---Generate Article input:---\n{initial_text}")
197
+ parts = ", ".join(parts).lower()
198
+ for_cache = initial_text + ' ' + parts
199
+ key = generate_key(for_cache)
200
+ if key in model_cache:
201
+ logging.info(f"Similar request was found in 'model_cache' and retrieved from it!")
202
+ yield model_cache[key]
203
+ else:
204
+ text_embedding = embedder.encode([initial_text])
205
+ chroma_collection = get_collection()
206
+ results = chroma_collection.query(
207
+ query_embeddings=text_embedding,
208
+ n_results=1
209
+ )
210
+ context = results['documents'][0] if results['documents'] else ""
211
+ if context == "":
212
+ logging.warning(f"COLLECTION QUERY: No context was found in the database!")
213
+
214
+ messages = [
215
+ {"role": "system", "content": """You are helpful Academic Research Assistant which helps to generate
216
+ necessary parts of the reserch based on the provided context.
217
+ The context is the following: 'written text' - this is the text that user
218
+ has for now and want to complete, 'parts' - those are the parts of paper
219
+ user needs to complete (it could be the abstract, introduction, methodology,
220
+ discussion, conclusion, or full text), 'context' - the similar article
221
+ the structure of which can be used as a base for the text (it can be empty
222
+ in case of absence of similar papers in the database.). The output should be
223
+ only generated article (or parts of it). The responce must be provided as a
224
+ raw text. Be precise and follow the structure of academic papers parts."""},
225
+ {"role": "user", "content": f"'written text': {initial_text}\n 'parts': {parts}\n 'context': {context}"},
226
+ ]
227
+ input_text = tokenizer.apply_chat_template(
228
+ messages,
229
+ add_generation_prompt=True,
230
+ tokenize=False,
231
+ )
232
+ inputs = tokenizer([input_text], return_tensors="pt").to(model.device)
233
+ streamer = TextIteratorStreamer(
234
+ tokenizer=tokenizer, skip_prompt=True, timeout=60.0, skip_special_tokens=True
235
+ )
236
+ generation_kwargs = {
237
+ **inputs,
238
+ "streamer": streamer,
239
+ }
240
+ thread = Thread(target=model.generate, kwargs=generation_kwargs)
241
+ thread.start()
242
+
243
+ response = ""
244
+ for new_text in streamer:
245
+ response += new_text
246
+ yield response
247
+ model_cache.set(key, response, expire=86400)
248
+ logging.info(f"The result was cached in 'model_cache'!")
249
+
250
+ def predict(goal: str, parts: list, context: str):
251
+ if context == "":
252
+ yield "Write your text first!"
253
+ logging.info("No context was provided!")
254
+ elif goal == 'Fix Academic Style':
255
+ formal_text = ""
256
+ for new_text in fix_academic_style(context):
257
+ formal_text = new_text
258
+ yield formal_text
259
+
260
+ logging.info(f"\n---Academic style corrected:---\n {formal_text}\n")
261
+ elif goal == 'Fix Grammar':
262
+ full_response = ""
263
+ for new_text in fix_grammar(context):
264
+ full_response = new_text
265
+ yield full_response
266
+
267
+ logging.info(f"\n---Grammar corrected:---\n{full_response}\n")
268
+ else:
269
+ full_response = ""
270
+ for new_text in _chat_stream(context, parts):
271
+ full_response = new_text
272
+ yield full_response
273
+
274
+ logging.info(f"\nThe text was generated!\n{full_response}")
main.py ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ from functionality import get_collection, predict
3
+
4
+ def read_file(file):
5
+ if file is not None:
6
+ with open(file.name, 'r') as f:
7
+ file_data = f.read()
8
+ return file_data
9
+ else:
10
+ return None
11
+
12
+ with gr.Blocks() as demo:
13
+ gr.Markdown("# **πŸ“ AI-powered Academic Research Assistant πŸ“**")
14
+ gr.Markdown("""**AI-powered Academic Research Assistant** is a tool which helps to
15
+ ensure the *correct grammar* and *academic style* in the scientific papers.
16
+
17
+ It also could help with *writing needed parts* or *proposing possible ideas*
18
+ for describing what you want in appropriate way.
19
+
20
+ ## πŸ“₯ Down bellow you should choose appropriate parameters for your goals and then wait a little for the responce!""")
21
+
22
+ gr.Markdown('πŸ“¨ Write the text you what to expand or upload corresponding text file.')
23
+
24
+ with gr.Tab('Write Text πŸ“–'):
25
+ gr.Markdown("βš™οΈ *Hint*: to ensure more effective work of 'Fix Academic Style' try to make your sentences not too long (<= 20 words).")
26
+ input_prompt = gr.Textbox(label='Initial Text πŸ“',
27
+ placeholder='Write here your research text!',
28
+ lines=9,)
29
+ with gr.Tab('Upload File πŸ“©'):
30
+ gr.Markdown("βš™οΈ *Hint*: to ensure more effective work of 'Fix Academic Style' try to make your sentences not too long (<= 20 words).")
31
+ txt_file = gr.File(file_types=['text',], label='Upload Text File',)
32
+ txt_file.change(read_file, inputs=txt_file, outputs=input_prompt)
33
+
34
+
35
+ gr.Markdown('✏️ Fill parameters for your needs')
36
+ with gr.Row(variant='panel', equal_height=True):
37
+ request_goal = gr.Radio(label='πŸ€” Specify the purpose of your request.',
38
+ info="Pick one:",
39
+ choices=['Write Text (Part)', 'Fix Academic Style', 'Fix Grammar', ],
40
+ value='Write Text (Part)',)
41
+
42
+ with gr.Accordion("❗️ In case you need to Write Text (Part) choose appropriate option!", open=False):
43
+ part_to_write = gr.CheckboxGroup(label="""πŸ“‹ What part for Assistant to write? (here you need to
44
+ specify what part of your research you need to complete.)""",
45
+ info="""You may chose as many as needed:""",
46
+ choices=['Abstract', 'Introduction',
47
+ 'Methodology', 'Discussion', 'Conclusion', 'Full Text',],
48
+ value='Abstract',)
49
+
50
+ with gr.Row(equal_height=True):
51
+ submit_btn = gr.Button('Confirm! βœ…')
52
+ clear_btn = gr.Button('Clear ❌', min_width=611)
53
+
54
+ gr.Markdown('##### πŸ“Œ Assistant Responce')
55
+ gr.Markdown("In case you did not satisfy with the responce try to paraphrase!")
56
+
57
+ responce = gr.Textbox(label="Generated Text πŸ‘¨πŸΌβ€πŸ’»",
58
+ info="""You may face some page jumps, it is a bug which will be fixed. Just wait for the completion of text generation.
59
+ Sorry for inconvenience(""",
60
+ lines=9,
61
+ placeholder='Here the generated text will appear!',
62
+ show_label=True,
63
+ show_copy_button=True,
64
+ autofocus=True,
65
+ autoscroll=True,)
66
+
67
+ submit_btn.click(fn=predict,
68
+ inputs=[request_goal, part_to_write, input_prompt,],
69
+ outputs=[responce],
70
+ scroll_to_output=True,)
71
+ clear_btn.click(lambda: (None, None, 'Write Text (Part)', 'Abstract', None), None,
72
+ outputs=[input_prompt, txt_file, request_goal, part_to_write, responce])
73
+
74
+ if __name__ == "__main__":
75
+ get_collection()
76
+ demo.launch()