Structure

1. Data: (dataset creation)/downloading existing dataset -> reading data
2. Top2Vec model training
3. Working with Top2vec features
4. Visualization
5. Gradio interface

In [None]:
#@title Installing necessary dependencies
%%capture
!pip install arxivscraper
!pip install top2vec
!pip install top2vec[sentence_encoders]
!pip install tensorflow==2.8.0
!pip install tensorflow-probability==0.16.0
!pip install gradio

## Data

Below are 2 options:
1. Create new dataset
2. Use an existing dataset that contains arXiv articles from th**e Computer Science** (CS) category, spanning from **2010 to 2023**.necessary dependencies

In [None]:
#@title Create dataset

### All commented but you can uncomment if you want to create a dataset

# # Extracting and Processing arXiv Data
# import arxivscraper
# import pandas as pd

# def scrape_and_save(category, start_year, end_year):
#     """Scrape arXiv data for a given category and range of years."""
#     for year in range(start_year, end_year):
#         scraper = arxivscraper.Scraper(category=category,
#                                        date_from=f'{year}-01-01',
#                                        date_until=f'{year+1}-01-01')
#         df = pd.DataFrame(scraper.scrape())
#         df.to_csv(f'arxiv_{category}_{year}.csv', index=False)
#         print(f'Data for {year} saved.')

# def combine_and_process(file_names):
#     """Combine multiple CSV files into a single DataFrame and process the data."""
#     df_list = []
#     for file_name in file_names:
#         df_temp = pd.read_csv(file_name, dtype={'id': str}, low_memory=False)
#         # Convert text columns to strings
#         text_columns = ['title', 'abstract', 'categories', 'doi', 'authors', 'url']
#         df_temp[text_columns] = df_temp[text_columns].astype(str)
#         # Convert date columns to datetime, with invalid dates set as NaT
#         date_columns = ['created', 'updated']
#         df_temp[date_columns] = pd.to_datetime(df_temp[date_columns], errors='coerce')
#         df_list.append(df_temp)
#     # Combine all DataFrames into one
#     df_combined = pd.concat(df_list, ignore_index=True)
#     # Convert NaNs to 'None' for text columns
#     df_combined[text_columns] = df_combined[text_columns].fillna('None')
#     return df_combined

# # Scrape data for the 'cs' category from 2010 to 2023
# scrape_and_save(category='cs', start_year=2010, end_year=2024)

# # Combine and process the scraped data
# file_names = [f'arxiv_cs_{year}.csv' for year in range(2010, 2024)]
# df_combined = combine_and_process(file_names)

# # Save the combined DataFrame as a Parquet file
# try:
#     df_combined.to_parquet('combined_data.parquet', index=False)
#     print("File successfully saved in Parquet format.")
# except Exception as e:
#     print(f"Error saving file: {e}")


In [None]:
#@title Downloading existing dataset
%%capture
!wget https://huggingface.co/datasets/CCRss/arxiv_papers_cs/resolve/main/arxiv_cs_from2010to2024-01-01.parquet

In [None]:
#@title Read data
import pandas as pd

file_name = '/content/arxiv_cs_from2010to2024-01-01.parquet' # specifying path
df = pd.read_parquet(file_name)

print(df.shape)
df.head()

(555563, 11)


Unnamed: 0,title,id,abstract,categories,doi,created,updated,authors,url,abstract_length,id_n
0,on-line viterbi algorithm and its relationship...,704.0062,"in this paper, we introduce the on-line viterb...",cs.ds,10.1007/978-3-540-74126-8_23,2007-03-31,NaT,"['šrámek', 'brejová', 'vinař']",https://arxiv.org/abs/0704.0062,711,0
1,capacity of a multiple-antenna fading channel ...,704.0217,given a multiple-input multiple-output (mimo) ...,cs.it math.it,10.1109/tit.2008.2011437,2007-04-02,2009-02-16,"['santipach', 'honig']",https://arxiv.org/abs/0704.0217,1641,1
2,refuting the pseudo attack on the reesse1+ cry...,704.0492,we illustrate through example 1 and 2 that the...,cs.cr,,2007-04-04,2010-02-04,"['su', 'lu']",https://arxiv.org/abs/0704.0492,1345,2
3,optimal routing for decode-and-forward based c...,704.0499,we investigate cooperative wireless relay netw...,cs.it math.it,10.1109/sahcn.2007.4292845,2007-04-04,NaT,"['ong', 'motani']",https://arxiv.org/abs/0704.0499,1164,3
4,on the kolmogorov-chaitin complexity for short...,704.1043,a drawback of kolmogorov-chaitin complexity (k...,cs.cc cs.it math.it,,2007-04-08,2010-12-16,"['delahaye', 'zenil']",https://arxiv.org/abs/0704.1043,861,4


##OpenAI api TODO


https://platform.openai.com/docs/guides/text-generation

In [None]:
# %%capture
# !pip install openai

In [None]:
# # Read the CSV file into a DataFrame
# df_open = pd.read_csv('/content/UAVs&ecology_keywords - Sheet1.csv')

In [None]:
# import pandas as pd
# from openai import OpenAI

# # OpenAI API setup
# client = OpenAI(api_key=' ')
# #TODO make instruction more strict
# instruction_for_topics = """
#       Instruction for Generating Topic Keywords:
#       Analyze the provided keywords from documents within the topic.
#       Identify the most representative and recurring keywords.
#       Exclude any variations of 'UAV', 'unmanned aerial vehicle', or 'drones'.
#       For multi-word keywords, split them into single words that are commonly recognized in the field.
#       Select the top five keywords that best capture the essence of the topic, ensuring they are single words.
#       Format Keywords: List the selected keywords in this format: ["keyword1", "keyword2", "keyword3", "keyword4", "keyword5"].
#       Provide Keywords Only: Respond with the formatted list of keywords, without any additional sentences or explanations.
# """
# # Generate keywords for each topic
# topic_keywords = {}
# for topic in df_open['topic'].unique():
#     combined_keywords = '; '.join(df_open[df_open['topic'] == topic]['keywords'])
#     completion = client.chat.completions.create(
#         model="gpt-3.5-turbo",
#         seed=5,
#         temperature=0.1,
#         max_tokens=100,
#         messages=[
#             {
#                 "role": "user",
#                 "content": instruction_for_topics + combined_keywords,
#             },
#         ],
#     )
#     topic_keywords[topic] = completion.choices[0].message.content

# print(topic_keywords)

In [None]:
# for topic, keywords in topic_keywords.items():
#     df_open.loc[df_open['topic'] == topic, 'generated_keywords'] = keywords

# # Now, 'df' will have a new column 'generated_keywords' with the generated keywords for each topic


In [None]:
# df_open.tail()

In [None]:
# df_open.to_csv('/content/UAVs&ecology_keywords_updated.csv', index=False)

## Top2Vec

### Model training

In [None]:
# ### All commented because model already trained and we can download it from huggingface
# #@title Specifying model parameters
# from top2vec import Top2Vec
# # Create a list of strings to pass it in Top2Vec
# docs = df.abstract.tolist()

# model = Top2Vec(
#     documents=docs,
#     speed='learn',
#     workers=80,
#     embedding_model='universal-sentence-encoder',
#     umap_args={'n_neighbors': 15,
#                'n_components': 5,
#                'metric': 'cosine',
#                'min_dist': 0.0,
#                'random_state': 42},
#     hdbscan_args={'min_cluster_size': 15,
#                   'metric': 'euclidean',
#                   'cluster_selection_method': 'eom'}
# )

In [None]:
#@title Save model
# model.save('arxiv_cs_from2010to2024-01-01')

### Model initialization and assigning Document Topics


In [None]:
#@title Downloading trained top2vec model from Hugging Face
%%capture
!wget https://huggingface.co/CCRss/topic_modeling_top2vec_scientific-texts/resolve/main/top2vec_model_arxiv_cs_from2010to2024-01-01

In [None]:
#@title Load model
from top2vec import Top2Vec
model = Top2Vec.load("/content/top2vec_model_arxiv_cs_from2010to2024-01-01")

In [None]:
#@title Assigning Document Topics and Creating a Sorted DataFrame
# Get topic sizes and numbers
topic_sizes, topic_nums = model.get_topic_sizes()

# Initialize an empty list for results
data = []

# Iterate over each topic
for topic_num in topic_nums:
    # Get documents belonging to the topic
    _, _, document_ids = model.search_documents_by_topic(topic_num=topic_num, num_docs=topic_sizes[topic_num])

    # Add document IDs and topic number to the list
    for doc_id in document_ids:
        data.append({'document_id': doc_id, 'topic_num': topic_num})

# Create a DataFrame from the list
df_new = pd.DataFrame(data)

# Sort the new DataFrame by document_id
df_new = df_new.sort_values(by='document_id').reset_index(drop=True)

# Assign topic numbers to the original DataFrame
df['topic_num'] = df_new['topic_num']

In [None]:
#@title Get topic representations.
topic_words, word_scores, topic_nums = model.get_topics()

# Function to create a topic representation string
def create_topic_representation(words, scores):
    return ', '.join(words[:5])  # Join the first 5 words with commas

# Create a list of topic representations
topic_representations = [create_topic_representation(words, scores) for words, scores in zip(topic_words, word_scores)]

# Convert the list to a pandas Series
topic_representation_dict = dict(zip(topic_nums, topic_representations))

# Map topic numbers to representations in the DataFrame
df['topic_representation'] = df['topic_num'].map(topic_representation_dict)

### Identification and evaluation of thematic groups

In [None]:
#@title Check the amount of documents
topic_sizes, topic_nums = model.get_topic_sizes()


print("The amount of docs in every topic:")
print("-" * 30)
for num, size in zip(topic_nums, topic_sizes):
    print(f"Topic {num}: {size} docs")

The amount of docs in every topic:
------------------------------
Topic 0: 40891 docs
Topic 1: 9478 docs
Topic 2: 9250 docs
Topic 3: 9052 docs
Topic 4: 7547 docs
Topic 5: 7365 docs
Topic 6: 6434 docs
Topic 7: 5701 docs
Topic 8: 5652 docs
Topic 9: 5367 docs
Topic 10: 5304 docs
Topic 11: 5127 docs
Topic 12: 4999 docs
Topic 13: 4830 docs
Topic 14: 4782 docs
Topic 15: 4269 docs
Topic 16: 3997 docs
Topic 17: 3994 docs
Topic 18: 3866 docs
Topic 19: 3817 docs
Topic 20: 3693 docs
Topic 21: 3637 docs
Topic 22: 3537 docs
Topic 23: 3430 docs
Topic 24: 3365 docs
Topic 25: 3150 docs
Topic 26: 3133 docs
Topic 27: 3115 docs
Topic 28: 3033 docs
Topic 29: 3026 docs
Topic 30: 3006 docs
Topic 31: 2990 docs
Topic 32: 2969 docs
Topic 33: 2939 docs
Topic 34: 2916 docs
Topic 35: 2879 docs
Topic 36: 2745 docs
Topic 37: 2677 docs
Topic 38: 2660 docs
Topic 39: 2645 docs
Topic 40: 2627 docs
Topic 41: 2591 docs
Topic 42: 2546 docs
Topic 43: 2508 docs
Topic 44: 2482 docs
Topic 45: 2472 docs
Topic 46: 2452 docs
Top

In the code below, we use the Top2Vec model to search for topics related to precision agriculture and UAVs. We specify keywords such as "uav," "precision," "agriculture," "crop," and others to find topics that are most relevant to these terms. The model then returns the top 20 topics that best match these keywords, along with their relevance scores, the top words associated with each topic, and the similarity scores of these words to the topic. The output includes details of each topic, such as the topic number, score, and the top words with their scores. This helps in understanding the focus of each topic and how it relates to the specified keywords.

In [None]:
#@title Get topic Details

def print_search_topics_details(topic_words, word_scores, topic_scores, topic_nums):
    """
    Function to print details of topics found by keywords and return a list of their topic numbers.
    :param topic_words: List of words for each topic.
    :param word_scores: List of word similarity scores with topics.
    :param topic_scores: Relevance scores of topics.
    :param topic_nums: Unique indexes of topics.
    :return: List of unique indexes of found topics.
    """
    relevant_topic_nums = []  # List to save numbers of relevant topics
    num_topics = len(topic_nums)
    for i in range(num_topics):
        print(f"Topic #{topic_nums[i]} (Score: {topic_scores[i]:.2f}):")
        print("-" * 50)
        for word, score in zip(topic_words[i][:5], word_scores[i][:5]):
            print(f"{word}: {score:.2f}")
        print("-" * 50)
        relevant_topic_nums.append(topic_nums[i])  # Add topic number to the list
    return relevant_topic_nums

# Keywords to search for topics
keywords = ["uav",
    "precision", "agriculture", "crop", "water","farming", "landscapes","land","monitoring","mapping"
    ]
# Search for topics by keywords
topic_words, word_scores, topic_scores, topic_nums = model.search_topics(keywords=keywords, num_topics=20)

# Get and print the list of relevant topics
relevant_topics = print_search_topics_details(topic_words, word_scores, topic_scores, topic_nums)

# Print the list of numbers of relevant topics
print("List of 20 relevant topics:", relevant_topics)

Topic #245 (Score: 0.18):
--------------------------------------------------
clustering: 0.28
crops: 0.28
harvesting: 0.26
crop: 0.26
predictors: 0.25
--------------------------------------------------
Topic #201 (Score: 0.17):
--------------------------------------------------
landsat: 0.37
clustering: 0.37
dataset: 0.34
datasets: 0.33
triangulation: 0.33
--------------------------------------------------
Topic #285 (Score: 0.16):
--------------------------------------------------
drones: 0.52
uavs: 0.49
drone: 0.46
quadcopters: 0.45
quadcopter: 0.44
--------------------------------------------------
Topic #497 (Score: 0.16):
--------------------------------------------------
deforestation: 0.33
forests: 0.32
landsat: 0.32
trees: 0.29
clustering: 0.27
--------------------------------------------------
Topic #286 (Score: 0.15):
--------------------------------------------------
triangulation: 0.40
voronoi: 0.37
geospatial: 0.34
roadways: 0.34
polyline: 0.33
----------------------------

In [None]:
#@title Our 6 topics about uav in ecology and emergency

# Natural disasters and emergencies = "natural", "disasters", "relief", "modelling","atmospheric", "emergency", "earthquake", "rescue"
# Water pollution = "remote", "sensing", "water", "pollution", "quality", "monitoring", "investigation", "machinelearning"
# Air pollution = "gas", "pollution", "air", "sensors", "environmental", "monitoring", "optical", "atmospheric","emissions"
# Household waste = "city", "dump", "emissions", "safety", "remote", "sensing", "garbage","household", "waste", "recycling"
# Infrared and thermal mapping = "population", "aerial", "surveys", "thermal", "remote","sensing", "infrared","mapping"
# Agricultural Mapping and Surveying = "precision", "agriculture", "crop", "water","farming", "landscapes","land","monitoring","mapping"

# uav_disasters_emergency = [591, 385, 577, 120, 402, 292, 444, 285, 178, 497, 100, 369, 176, 211, 373, 355, 572, 685, 201, 453]
# uav_water_pollution = [285, 176, 258, 116, 603, 111, 143, 595, 226, 455, 485, 92, 457, 88, 100, 360, 172, 371, 358, 639]
# uav_air_pollution = [455, 285, 176, 448, 100, 258, 485, 402, 603, 111, 577, 358, 661, 286, 453, 678, 629, 497, 489, 572]
# uav_household_waste = [661, 285, 455, 100, 620, 176, 676, 448, 402, 348, 642, 595, 178, 371, 123, 387, 286, 66, 258, 201]
# uav_infrared_thermal = [285, 176, 100, 603, 116, 642, 342, 172, 358, 143, 54, 508, 457, 201, 286, 371, 485, 88, 111, 452]
# uav_agriculture_mapping = [245, 201, 285, 497, 286, 514, 666, 639, 371, 373, 100, 120, 453, 348, 336, 176, 480, 572, 402, 54]

In [None]:
#@title Unique topics
uav_disasters_emergency = [591, 385, 577, 120, 402, 292, 444, 285, 178, 497, 100, 369, 176, 211, 373, 355, 572, 685, 201, 453]
uav_water_pollution = [285, 176, 258, 116, 603, 111, 143, 595, 226, 455, 485, 92, 457, 88, 100, 360, 172, 371, 358, 639]
uav_air_pollution = [455, 285, 176, 448, 100, 258, 485, 402, 603, 111, 577, 358, 661, 286, 453, 678, 629, 497, 489, 572]
uav_household_waste = [661, 285, 455, 100, 620, 176, 676, 448, 402, 348, 642, 595, 178, 371, 123, 387, 286, 66, 258, 201]
uav_infrared_thermal = [285, 176, 100, 603, 116, 642, 342, 172, 358, 143, 54, 508, 457, 201, 286, 371, 485, 88, 111, 452]
uav_agriculture_mapping = [245, 201, 285, 497, 286, 514, 666, 639, 371, 373, 100, 120, 453, 348, 336, 176, 480, 572, 402, 54]
# Combining all lists into one for further analysis
all_topics = uav_disasters_emergency + uav_air_pollution + uav_water_pollution + uav_household_waste + uav_infrared_thermal + uav_agriculture_mapping

# Creating a list of unique topics for each group
unique_disasters_emergency = [topic for topic in uav_disasters_emergency if all_topics.count(topic) == 1]
unique_air_pollution = [topic for topic in uav_air_pollution if all_topics.count(topic) == 1]
unique_water_pollution = [topic for topic in uav_water_pollution if all_topics.count(topic) == 1]
unique_household_waste = [topic for topic in uav_household_waste if all_topics.count(topic) == 1]
unique_infrared_thermal = [topic for topic in uav_infrared_thermal if all_topics.count(topic) == 1]
unique_agriculture_mapping = [topic for topic in uav_agriculture_mapping if all_topics.count(topic) == 1]

# Printing unique lists of topics
print("Unique topics")
print("Disasters & Emergency Topics:", unique_disasters_emergency)
print("Air Pollution Topics:", unique_air_pollution)
print("Water Pollution Topics:", unique_water_pollution)
print("Household Waste Topics:", unique_household_waste)
print("Infrared & Thermal Topics:", unique_infrared_thermal)
print("Agriculture Mapping Topics:", unique_agriculture_mapping)

Unique topics
Disasters & Emergency Topics: [591, 385, 292, 444, 369, 211, 355, 685]
Air Pollution Topics: [678, 629, 489]
Water Pollution Topics: [226, 92, 360]
Household Waste Topics: [620, 676, 123, 387, 66]
Infrared & Thermal Topics: [342, 508, 452]
Agriculture Mapping Topics: [245, 514, 666, 336, 480]


In [None]:
#@title Intersecting topics
#TODO find a way to use this information
# Collecting all lists into a dictionary for easier processing
topics_groups_uav = {
    "disasters_emergency": uav_disasters_emergency,
    "air_pollution": uav_air_pollution,
    "water_pollution": uav_water_pollution,
    "household_waste": uav_household_waste,
    "infrared_thermal": uav_infrared_thermal,
    "agriculture_mapping": uav_agriculture_mapping
}

# Creating a list of all topics
all_topics = sum(topics_groups_uav.values(), [])

# Determining topics that occur more than once (intersect between groups)
intersecting_topics = set(topic for topic in all_topics if all_topics.count(topic) > 1)

print("Intersecting topics:", intersecting_topics)

Intersecting topics: {258, 642, 143, 402, 661, 285, 286, 172, 176, 178, 54, 572, 448, 577, 453, 455, 201, 457, 595, 88, 603, 348, 100, 485, 358, 111, 497, 371, 116, 373, 120, 639}


In the code below, we use function assign_and_filter_topics, users can pass their own DataFrame, topic_groups dictionary, and optionally specify a start_date and end_date for filtering. The function then assigns topic groups, filters the DataFrame based on the specified topic numbers and date range, and returns the filtered DataFrame. This makes it more flexible and reusable for different sets of topic groups and date ranges.

In [None]:
#@title Assigning Topic Groups to Filtered DataFrame

# Define topic groups
topic_groups = {
    'Disasters & Emergency': [591, 385, 292, 444, 369, 211, 355, 685],
    'Air Pollution': [678, 629, 489],
    'Water Pollution': [226, 92, 360],
    'Household Waste': [620, 676, 123, 387, 66],
    'Infrared & Thermal': [342, 508, 452],
    'Agriculture Mapping': [245, 514, 666, 336, 480]
}

# Function to assign topic group based on topic number
def assign_topic_group(topic_num):
    for group_name, topics in topic_groups.items():
        if topic_num in topics:
            return group_name
    return 'Other'

# Convert 'created' column to datetime format
df['created'] = pd.to_datetime(df['created'])

# Combine all topics into a single list for filtering
all_topics = sum(topic_groups.values(), [])

# Filter the DataFrame based on topic numbers and date range, and create a copy
df_filtered = df[(df['topic_num'].isin(all_topics)) &
                 (df['created'] >= '2010-01-01') &
                 (df['created'] <= '2023-12-31')].copy()

# Assign topic groups and extract the year from the 'created' column
df_filtered.loc[:, 'topic_group'] = df_filtered['topic_num'].apply(assign_topic_group)
df_filtered.loc[:, 'year'] = df_filtered['created'].dt.year

# Display the first few rows of the filtered DataFrame
df_filtered.head()

Unnamed: 0,title,id,abstract,categories,doi,created,updated,authors,url,abstract_length,id_n,topic_num,topic_representation,topic_group,year
712,pseudorandomness in central force optimization,1001.0317,central force optimization is a deterministic ...,cs.oh,,2010-01-02,2010-02-03,['formato'],https://arxiv.org/abs/1001.0317,1089,712,66,"bayesian, kalman, gaussian, filtering, laplacian",Household Waste,2010
772,construction of wiretap codes from ordinary ch...,1001.1197,from an arbitrary given channel code over a di...,cs.it cs.cr math.it,10.1109/isit.2010.5513794,2010-01-08,NaT,"['hayashi', 'matsumoto']",https://arxiv.org/abs/1001.1197,335,772,123,"cryptography, rssi, transceiver, ciphers, deco...",Household Waste,2010
834,a new method to extract dorsal hand vein patte...,1001.1966,"among all biometric, dorsal hand vein pattern ...",cs.cv cs.cr,,2010-01-12,NaT,"['khan', 'khan']",https://arxiv.org/abs/1001.1966,656,834,360,"fingerprint, fingerprinting, fingerprints, bio...",Water Pollution,2010
855,message detection and extraction of chaotic op...,1001.206,the security of chaotic optical communication ...,cs.cr,,2010-01-12,2010-03-18,"['zhao', 'yin']",https://arxiv.org/abs/1001.2060,639,855,123,"cryptography, rssi, transceiver, ciphers, deco...",Household Waste,2010
869,towards a generic framework to generate explan...,1001.2188,"in this report, we show how to use the simple ...",cs.pl,,2010-01-13,NaT,"['deransart', 'oliveira']",https://arxiv.org/abs/1001.2188,752,869,92,"tracking, tracker, tracked, triangulation, com...",Water Pollution,2010


### Analysis of the dynamics of thematic groups


In the code below, we analyze the growth and decline in interest in topics over time. We calculate the yearly change in the number of publications for each topic, the acceleration of growth, and the relative growth.

We then identify the top 5 topics with the greatest increase and decrease in interest, as well as low-volume but fast-growing topic groups.

The results are stored in a dictionary for further analysis and visualizations.

In [None]:
#@title Analyzing Topic Growth and Decline

# Count the number of publications for each topic by year
publications_per_topic_year = df_filtered.groupby(['topic_num', 'year']).size().unstack(fill_value=0)

# Calculate the yearly change in the number of publications for each topic (growth rate)
growth_per_topic = publications_per_topic_year.diff(axis=1)

# Calculate the acceleration of growth (change in growth rate)
acceleration_per_topic = growth_per_topic.diff(axis=1)

# Calculate the relative growth for each year
relative_growth_per_topic = growth_per_topic / publications_per_topic_year.shift(1)

# Sum the changes to assess the overall increase/decrease in interest in topics over the entire period
total_growth_per_topic = growth_per_topic.sum(axis=1)

# Topics with the greatest increase in interest
top_growing_topics = total_growth_per_topic.nlargest(5)

# Topics with the greatest decrease in interest
top_declining_topics = total_growth_per_topic.nsmallest(5)

# Identify low-volume but fast-growing topic groups
volume_threshold = 50
low_volume_topics = publications_per_topic_year[publications_per_topic_year.sum(axis=1) < volume_threshold].index
low_volume_fast_growing_topics = total_growth_per_topic[low_volume_topics].nlargest(5)

# Store the results in a dictionary or DataFrame for further analysis or visualization
analysis_results = {
    'top_growing_topics': top_growing_topics,
    'top_declining_topics': top_declining_topics,
    'low_volume_fast_growing_topics': low_volume_fast_growing_topics,
    # Add other metrics here as needed
}

# Display the results
print("Top 5 topics with the greatest increase in interest:")
print(top_growing_topics)
print("\nTop 5 topics with the greatest decrease in interest:")
print(top_declining_topics)
print("\nLow-volume, fast-growing topic groups:")
print(low_volume_fast_growing_topics)

Top 5 topics with the greatest increase in interest:
topic_num
92     179.0
66     157.0
211    143.0
245    137.0
226    130.0
dtype: float64

Top 5 topics with the greatest decrease in interest:
topic_num
452   -10.0
685     1.0
676     4.0
620     8.0
666     8.0
dtype: float64

Low-volume, fast-growing topic groups:
topic_num
678    9.0
676    4.0
685    1.0
dtype: float64


In [None]:
# Create a dictionary to store the counts and groups
topic_info = {}

# Loop through each row in the dataframe
for index, row in df_filtered.iterrows():
    # Get the topic_num and topic_group values
    topic_num = row["topic_num"]
    topic_group = row["topic_group"]

    # If the topic_num is not already in the dictionary, add it with a count of 1 and the topic_group
    if topic_num not in topic_info:
        topic_info[topic_num] = {"count": 1, "group": topic_group}
    # Otherwise, increment the count for that topic_num
    else:
        topic_info[topic_num]["count"] += 1

# Print the topic_num counts and groups
for topic_num, info in topic_info.items():
    print(f"Topic num: {topic_num}, Count: {info['count']}, Group: {info['group']}")


Topic num: 66, Count: 1835, Group: Household Waste
Topic num: 123, Count: 1065, Group: Household Waste
Topic num: 360, Count: 334, Group: Water Pollution
Topic num: 92, Count: 1378, Group: Water Pollution
Topic num: 452, Count: 216, Group: Infrared & Thermal
Topic num: 369, Count: 320, Group: Disasters & Emergency
Topic num: 292, Count: 458, Group: Disasters & Emergency
Topic num: 245, Count: 571, Group: Agriculture Mapping
Topic num: 226, Count: 616, Group: Water Pollution
Topic num: 342, Count: 367, Group: Infrared & Thermal
Topic num: 591, Count: 117, Group: Disasters & Emergency
Topic num: 385, Count: 300, Group: Disasters & Emergency
Topic num: 387, Count: 298, Group: Household Waste
Topic num: 336, Count: 381, Group: Agriculture Mapping
Topic num: 678, Count: 42, Group: Air Pollution
Topic num: 508, Count: 176, Group: Infrared & Thermal
Topic num: 676, Count: 42, Group: Household Waste
Topic num: 489, Count: 186, Group: Air Pollution
Topic num: 211, Count: 643, Group: Disasters &

In [None]:
df_filtered.shape

(10556, 15)

In [None]:
from collections import Counter
import pandas as pd

# Selecting a topic for analysis
topic_num_to_analyze = 92

# Filtering the DataFrame for the selected topic
topic_info = df_filtered[df_filtered['topic_num'] == topic_num_to_analyze]

# Displaying aggregated information about the selected topic
print(f"Insights for Topic #{topic_num_to_analyze}:")
print("-" * 50)

# Top Keywords
top_keywords = Counter(" ".join(topic_info['abstract']).split()).most_common(10)
print("Top Keywords:")
for keyword, count in top_keywords:
    print(f"{keyword}: {count}")

# Year Distribution
year_distribution = topic_info['year'].value_counts().sort_index()
print("\nYear Distribution:")
print(year_distribution)

# Top Authors (assuming a 'authors' column exists)
top_authors = Counter(", ".join(topic_info['authors']).split(", ")).most_common(5)
print("\nTop Authors:")
for author, count in top_authors:
    print(f"{author}: {count}")

# Summary Statistics
print("\nSummary Statistics:")
print(f"Total Papers: {len(topic_info)}")
print(f"Publication Years: {topic_info['year'].min()} - {topic_info['year'].max()}")

print("-" * 50)


Insights for Topic #92:
--------------------------------------------------
Top Keywords:
the: 11147
and: 6288
of: 5793
a: 5170
to: 5169
in: 4485
tracking: 3317
we: 2814
is: 2569
for: 2336

Year Distribution:
2010      5
2011      8
2012     22
2013     14
2014     19
2015     44
2016     53
2017    110
2018    119
2019    169
2020    212
2021    212
2022    207
2023    184
Name: year, dtype: int64

Top Authors:
'wang': 158
'li': 134
'zhang': 122
'liu': 80
'yang': 66

Summary Statistics:
Total Papers: 1378
Publication Years: 2010 - 2023
--------------------------------------------------


In [None]:
from tabulate import tabulate

#@title Analyzing a specific thematic group
group_name_to_analyze = "Disasters & Emergency"  # Replace with the name of the group you are interested in

# Get the topic numbers for the selected thematic group
topic_nums = topic_groups[group_name_to_analyze]

# Filter data for the selected group
group_data = df_filtered[df_filtered['topic_num'].isin(topic_nums)]

# Aggregate data by year for the selected group
group_publications_per_year = group_data.groupby('year')['topic_num'].count()
group_growth = group_publications_per_year.diff().fillna(0)
group_relative_growth = (group_growth / group_publications_per_year.shift(1) * 100).fillna(0)

# Create a DataFrame with the dynamics analysis for the selected group
group_analysis = pd.DataFrame({
    'Year': group_publications_per_year.index,
    'Number of Publications': group_publications_per_year.values,
    'Growth Acceleration': group_growth.diff().fillna(0).round(2),  # Adding growth acceleration
    'Change in Number of Publications': group_growth.values,
    'Relative Growth': group_relative_growth.round(2).astype(str) + '%'  # Rounding and formatting relative growth
}).set_index('Year')

# Display the analysis for the selected thematic group
print(f"Analysis for Thematic Group: {group_name_to_analyze}")
print(tabulate(group_analysis, headers='keys', tablefmt='pipe', showindex=True))


Analysis for Thematic Group: Disasters & Emergency
|   Year |   Number of Publications |   Growth Acceleration |   Change in Number of Publications | Relative Growth   |
|-------:|-------------------------:|----------------------:|-----------------------------------:|:------------------|
|   2010 |                       19 |                     0 |                                  0 | 0.0%              |
|   2011 |                       15 |                    -4 |                                 -4 | -21.05%           |
|   2012 |                       28 |                    17 |                                 13 | 86.67%            |
|   2013 |                       38 |                    -3 |                                 10 | 35.71%            |
|   2014 |                       28 |                   -20 |                                -10 | -26.32%           |
|   2015 |                       47 |                    29 |                                 19 | 67.86%           

In [None]:
# Get unique values of the "topic_group" column
unique_values = df_filtered["topic_group"].unique()

print(unique_values)

['Household Waste' 'Water Pollution' 'Infrared & Thermal'
 'Disasters & Emergency' 'Agriculture Mapping' 'Air Pollution']


In [None]:
# Selecting the thematic group for analysis
group_to_analyze = "Water Pollution"

# Extracting data for the selected thematic group
group_data = df_filtered[df_filtered['topic_group'] == group_to_analyze]

# Calculating the number of publications per year for the selected group
group_publications_per_year = group_data.groupby('year').size()

# Calculating the total number of publications and the total growth over the entire period
total_publications = group_publications_per_year.sum()
total_growth = group_publications_per_year.diff().sum()

# Calculating the average annual growth
average_annual_growth = total_growth / (group_publications_per_year.index.max() - group_publications_per_year.index.min())

# Printing the summary statistics
print(f"Thematic Group: {group_to_analyze}")
print(f"Total Publications: {total_publications}")
print(f"Total Growth: {total_growth}")
print(f"Average Annual Growth: {average_annual_growth:.2f}")


Thematic Group: Water Pollution
Total Publications: 2328
Total Growth: 339.0
Average Annual Growth: 26.08


In [None]:
#@title Visualization of the topic trend analysis
import plotly.graph_objects as go
import numpy as np

def visualize_topic_trend_plotly(topic_num, topic_group):
    # Extract data for the specific topic
    publications = publications_per_topic_year.loc[topic_num]
    changes = growth_per_topic.loc[topic_num]
    relative_growth = relative_growth_per_topic.loc[topic_num] * 100  # Convert to percentage

    # Replace NaN values with zeros
    relative_growth = relative_growth.fillna(0)

    years = publications.index

    # Create a plot for the number of publications
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=years, y=publications, mode='lines+markers', name='Number of Publications', marker=dict(size=8, color='blue')))

    # Create a plot for the change in the number of publications
    fig.add_trace(go.Bar(x=years, y=changes, name='Change in Number of Publications', marker_color='orange', opacity=0.6))

    # Create a plot for the relative growth
    fig.add_trace(go.Scatter(x=years, y=relative_growth, mode='lines', name='Relative Growth (%)', yaxis='y2', line=dict(color='green', width=2, dash='dash')))

    # Customize the layout
    fig.update_layout(
        title=f'Trend Analysis for Topic {topic_num} ({topic_group})',
        xaxis_title='Year',
        yaxis_title='Number of Publications',
        yaxis2=dict(title='Relative Growth (%)', overlaying='y', side='right', range=[-100, 100]),
        legend=dict(x=1.05, y=1, traceorder='reversed', font_size=16),
        barmode='overlay',
        template='plotly_white'
    )

    # Show the plot
    fig.show()

# Example usage
topic_num_to_analyze = 92
topic_group = df_filtered[df_filtered['topic_num'] == topic_num_to_analyze]['topic_group'].iloc[0]
print(f"Trend Analysis for Topic #{topic_num_to_analyze} ({topic_group}):")
visualize_topic_trend_plotly(topic_num_to_analyze, topic_group)


Trend Analysis for Topic #92 (Water Pollution):


In [None]:
#@title Visualization of the thematic group trend analysis
def visualize_topic_trend_plotly(group_name, df_filtered, topics_groups):
    # Filter the dataframe for the selected topic group
    topic_nums = topics_groups[group_name]
    df_group = df_filtered[df_filtered['topic_num'].isin(topic_nums)]

    # Group by year and sum up publications
    publications_per_year = df_group.groupby('year')['topic_num'].count()

    # Calculate changes and relative growth
    changes = publications_per_year.diff().fillna(0)
    relative_growth = (changes / publications_per_year.shift(1) * 100).fillna(0)

    years = publications_per_year.index

    # Create a plot for the number of publications
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=years, y=publications_per_year, mode='lines+markers', name='Number of Publications', marker=dict(size=8, color='blue')))

    # Create a plot for the change in the number of publications
    fig.add_trace(go.Bar(x=years, y=changes, name='Change in Number of Publications', marker_color='orange', opacity=0.6))

    # Create a plot for the relative growth
    fig.add_trace(go.Scatter(x=years, y=relative_growth, mode='lines', name='Relative Growth (%)', yaxis='y2', line=dict(color='green', width=2, dash='dash')))

    # Customize the layout
    fig.update_layout(
        title=f'Trend Analysis for {group_name}',
        xaxis_title='Year',
        yaxis_title='Number of Publications',
        yaxis2=dict(title='Relative Growth (%)', overlaying='y', side='right'),
        legend=dict(x=1.05, y=1, traceorder='reversed', font_size=16),
        barmode='overlay',
        template='plotly_white'
    )

    # Show the plot
    fig.show()

# Example usage
group_name = 'Household Waste'  # Replace with the actual name of the group
visualize_topic_trend_plotly(group_name, df_filtered, topic_groups)


####  Plots of median? # TODO

###  Plots

In [None]:
from sklearn.linear_model import LinearRegression
import plotly.express as px

# Grouping by month and year of creation and thematic group
df_grouped = df_filtered.groupby([df_filtered['created'].dt.to_period("M"), 'topic_group']).size().reset_index(name='counts')
df_grouped['created'] = df_grouped['created'].dt.to_timestamp()

# For convenience, add a column with the number of months since the start of 2010
df_grouped['months_since_start'] = (df_grouped['created'].dt.year - 2010) * 12 + df_grouped['created'].dt.month - 1


In [None]:
#@title Polynomial Trends
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# Prepare colors for each thematic group
colors = {
    'Disasters & Emergency': 'rgba(255, 99, 132, 1)',  # Red
    'Air Pollution': 'rgba(54, 162, 235, 1)',  # Blue
    'Water Pollution': 'rgba(255, 206, 86, 1)',  # Yellow
    'Household Waste': 'rgba(75, 192, 192, 1)',  # Green
    'Infrared & Thermal': 'rgba(153, 102, 255, 1)',  # Purple
    'Agriculture Mapping': 'rgba(10, 0, 99, 255)'  # Cyan
}

# Initialize the start year
start_year = 2010

fig = go.Figure()

# Convert 'months_since_start' to 'year-month' for tooltips
def convert_to_year_month(months_since_start):
    year = start_year + months_since_start // 12
    month = months_since_start % 12 + 1  # +1 because counting starts from 0
    return f"{year}-{month:02d}"

# Add data and trends for each group
for group_name in topic_groups.keys():
    df_filtered_group = df_grouped[df_grouped['topic_group'] == group_name]
    if not df_filtered_group.empty:
        X = df_filtered_group['months_since_start'].values.reshape(-1, 1)
        y = df_filtered_group['counts']
        poly = PolynomialFeatures(degree=3)
        X_poly = poly.fit_transform(X)
        model = LinearRegression()
        model.fit(X_poly, y)
        X_pred = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)
        X_pred_poly = poly.transform(X_pred)
        y_pred = model.predict(X_pred_poly)

        tooltips = [f"{convert_to_year_month(m)}, Publications: {c}" for m, c in zip(df_filtered_group['months_since_start'], df_filtered_group['counts'])]

        # Determine color for the current group
        color = colors.get(group_name, 'rgba(0, 0, 0, 1)')  # Black by default

        fig.add_trace(go.Scatter(
            x=df_filtered_group['months_since_start'],
            y=df_filtered_group['counts'],
            mode='markers',
            name=f'Real Data {group_name}',
            text=tooltips,
            hoverinfo='text',
            marker=dict(color=color, opacity=0.5)  # Point transparency
        ))

        fig.add_trace(go.Scatter(
            x=X_pred.flatten(),
            y=y_pred,
            mode='lines',
            name=f'Trend {group_name}',
            line=dict(color=color)  # Line color
        ))

# Plot settings
fig.update_layout(
    title='Polynomial Regression for All Thematic Groups',
    xaxis_title='Year',
    yaxis_title='Number of Publications',
    legend_title='Thematic Group',
    width=1200, height=600
)

# Update the X-axis to display years
max_months_since_start = df_grouped['months_since_start'].max()
ticks_vals = np.arange(0, max_months_since_start + 1, 12)  # Every 12 months
ticks_text = [str(start_year + int(months / 12)) for months in ticks_vals]
fig.update_xaxes(tickvals=ticks_vals, ticktext=ticks_text)

# Show the plot
fig.show()


In [None]:
df['year'] = df['created'].dt.year
total_publications_per_year = df.groupby('year').size()

yearly_change = total_publications_per_year.diff().dropna()
average_yearly_change = yearly_change.mean()
print(f"Average yearly change in the number of publications: {average_yearly_change}")

# Initialize a dictionary to store the average change for each topic
topic_yearly_change = {}

# Iterate over unique topics
for topic_num in df['topic_num'].unique():
    # Filter the dataframe by the current topic
    topic_df = df[df['topic_num'] == topic_num]

    # Count publications per year for the current topic
    publications_per_year = topic_df.groupby('year').size()

    # Calculate the change in the number of publications per year and its mean
    change = publications_per_year.diff().dropna()
    topic_yearly_change[topic_num] = change.mean()

# Determine topics whose average change exceeds the overall average
topics_above_average = {topic: change for topic, change in topic_yearly_change.items() if change > average_yearly_change}

print("Topics with yearly change above the overall average:")
for topic, change in topics_above_average.items():
    print(f"Topic #{topic} - Average yearly change: {change}")


Average yearly change in the number of publications: 3175.9333333333334
Topics with yearly change above the overall average:


In [None]:
# Create a full list of all topics
all_topics = [topic for topics in topic_groups.values() for topic in topics]

# Filter df_filtered to include only topics from all_topics
df_relevant = df_filtered[df_filtered['topic_num'].isin(all_topics)]

# Count the total number of publications for each thematic group
group_counts = {group: df_relevant[df_relevant['topic_num'].isin(topics)]['topic_num'].count()
                for group, topics in topic_groups.items()}

# Output the total number of publications for each thematic group
for group, count in group_counts.items():
    print(f"Thematic group '{group}' - Total number of publications: {count}")

# Count the total number of publications for all topics
total_posts = df_relevant['topic_num'].count()

# Output the total number of publications for all thematic groups
print(f"Total number of publications for all thematic groups: {total_posts}")


Thematic group 'Disasters & Emergency' - Total number of publications: 2450
Thematic group 'Air Pollution' - Total number of publications: 317
Thematic group 'Water Pollution' - Total number of publications: 2328
Thematic group 'Household Waste' - Total number of publications: 3333
Thematic group 'Infrared & Thermal' - Total number of publications: 759
Thematic group 'Agriculture Mapping' - Total number of publications: 1369
Total number of publications for all thematic groups: 10556


In [None]:
import gradio as gr

def get_info(input_value):
    try:
        # Assume input is a topic number
        topic_num = int(input_value)
        html_table, plot = get_topic_analysis(topic_num)
    except ValueError:
        # Input is not a topic number, assume it's a group name
        html_table, plot = get_group_analysis(input_value)
    return html_table, plot

def get_topic_analysis(topic_num):
    topic_group = df_filtered[df_filtered['topic_num'] == topic_num]['topic_group'].iloc[0]
    topic_data = publications_per_topic_year.loc[topic_num]
    topic_growth = growth_per_topic.loc[topic_num]
    topic_relative_growth = relative_growth_per_topic.loc[topic_num] * 100
    topic_growth_acceleration = topic_growth.diff().fillna(0)  # Расчет ускорения роста

    topic_analysis = pd.DataFrame({
        'Year': topic_data.index,
        'Number of Publications': topic_data.values,
        'Change in Number of Publications': topic_growth.values,
        'Growth Acceleration': topic_growth_acceleration.values,  # Добавление ускорения роста
        'Relative Growth': topic_relative_growth.values
    }).set_index('Year')

    topic_analysis = topic_analysis.reset_index()
    topic_analysis = topic_analysis.round(2)
    html_table = topic_analysis.to_html(classes="table table-striped", justify="left", border=0)

    plot = visualize_trend_analysis(topic_data.index, topic_data.values, topic_growth.values, topic_relative_growth.values, f'Topic {topic_num} ({topic_group})')

    return html_table, plot

def get_group_analysis(group_name):
    topic_nums = topic_groups[group_name]
    df_group = df_filtered[df_filtered['topic_num'].isin(topic_nums)]
    group_publications_per_year = df_group.groupby('year')['topic_num'].count()

    changes = group_publications_per_year.diff().fillna(0)
    relative_growth = (changes / group_publications_per_year.shift(1) * 100).fillna(0)
    growth_acceleration = changes.diff().fillna(0)  # Расчет ускорения роста

    group_analysis = pd.DataFrame({
        'Year': group_publications_per_year.index,
        'Number of Publications': group_publications_per_year.values,
        'Change in Number of Publications': changes.values,
        'Growth Acceleration': growth_acceleration.values,  # Добавление ускорения роста
        'Relative Growth': relative_growth.values
    }).set_index('Year')

    group_analysis = group_analysis.reset_index()
    group_analysis = group_analysis.round(2)
    html_table = group_analysis.to_html(classes="table table-striped", justify="left", border=0)

    plot = visualize_trend_analysis(group_publications_per_year.index, group_publications_per_year.values, changes.values, relative_growth.values, f'Group: {group_name}')

    return html_table, plot


def visualize_trend_analysis(years, publications, changes, relative_growth, title):
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=years, y=publications, mode='lines+markers', name='Number of Publications', marker=dict(size=8, color='blue')))
    fig.add_trace(go.Bar(x=years, y=changes, name='Change in Number of Publications', marker_color='orange', opacity=0.6))
    fig.add_trace(go.Scatter(x=years, y=relative_growth, mode='lines', name='Relative Growth (%)', yaxis='y2', line=dict(color='green', width=2, dash='dash')))

    fig.update_layout(
        title=f'Trend Analysis for {title}',
        xaxis_title='Year',
        yaxis_title='Number of Publications',
        yaxis2=dict(title='Relative Growth (%)', overlaying='y', side='right', range=[-100, 100]),
        legend=dict(x=1.05, y=1, traceorder='reversed', font_size=16),
        barmode='overlay',
        template='plotly_white'
    )

    return fig

def get_available_topics_and_groups():
    # Creating a list of all topics
    all_topics = [topic for topics in topic_groups.values() for topic in topics]

    # Filtering df_filtered to include only topics from all_topics
    df_relevant = df_filtered[df_filtered['topic_num'].isin(all_topics)]

    # Counting the total number of publications for each thematic group
    group_counts = {group: df_relevant[df_relevant['topic_num'].isin(topics)]['topic_num'].count()
                    for group, topics in topic_groups.items()}

    # Counting the total number of publications for each topic
    topic_counts = df_relevant['topic_num'].value_counts().sort_index()

    # Generating the summary information
    summary = "<b>Available Topics and Thematic Groups:</b><br><br>"
    summary += "<b>Thematic Groups:</b><br>"
    for group, count in group_counts.items():
        summary += f"- {group}: {count} publications<br>"

    summary += "<br><b>Topics:</b><br>"
    for topic, count in topic_counts.items():
        summary += f"- Topic {topic}: {count} publications<br>"

    return summary

# Get the available topics and groups information
available_info = get_available_topics_and_groups()

# Modify the description to include the available topics and groups information
description = """
Enter a topic number or a thematic group name to get information.<br><br>
""" + available_info

iface = gr.Interface(
    fn=get_info,
    inputs=gr.Textbox(label="Topic Number or Thematic Group Name"),
    outputs=[
        gr.HTML(label="Information"),
        gr.Plot(label="Trend Analysis")
    ],
    title="Topic and Thematic Group Analysis",
    description=description
)

iface.launch(debug=True)

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://dcb7ea4c0d4b942782.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://dcb7ea4c0d4b942782.gradio.live




### Operations with topics and documents

In [None]:
model2 = Top2Vec.load("/content/top2vec_model_arxiv_cs_from2010to2024-01-01")

In [None]:
#@title Search for documents on a specified topic

print("Searching for documents on the topic")
print("=" * 30)

# Performing a search for documents on the specified topic
# In this case, we are looking for 3 documents related to topic number 100
topic_number = 100
number_of_documents = 3
documents, document_scores, document_ids = model2.search_documents_by_topic(topic_num=topic_number, num_docs=number_of_documents)

# Printing the search results
for doc, score, doc_id in zip(documents, document_scores, document_ids):
    print(f"Document ID: {doc_id}, Similarity: {score:.3f}")
    print("-" * 30)
    print(doc)
    print("-" * 30)
    print()

# Variable descriptions:
# documents: a list of documents, ordered from most to least similar to the topic.
# document_scores: semantic similarity scores of documents to the topic, measured as cosine similarity between document and topic vectors.
# document_ids: unique identifiers for the documents. If identifiers were not provided, the index of the document in the original corpus is used.

Searching for documents on the topic
Document ID: 252742, Similarity: 0.880
------------------------------
due to the advantages of flexible deployment and extensive coverage, unmannedaerial vehicles (uavs) have great potential for sensing applications in thenext generation of cellular networks, which will give rise to a cellularinternet of uavs. in this paper, we consider a cellular internet of uavs, wherethe uavs execute sensing tasks through cooperative sensing and transmission tominimize the age of information (aoi). however, the cooperative sensing andtransmission is tightly coupled with the uavs' trajectories, which makes thetrajectory design challenging. to tackle this challenge, we propose adistributed sense-and-send protocol, where the uavs determine the trajectoriesby selecting from a discrete set of tasks and a continuous set of locations forsensing and transmission. based on this protocol, we formulate the trajectorydesign problem for aoi minimization and propose a compound

In [None]:
#@title Search across all documents
# Search for documents using the keywords 'cryptography' and 'privacy'
print("Searching for documents using the keywords 'cryptography' and 'privacy'")
print("=" * 50)

# Keywords and number of documents to search for
keywords = ["uav", "ecology","radars"]
num_docs_to_search = 5

# Performing the search
documents, document_scores, document_ids = model2.search_documents_by_keywords(keywords=keywords, num_docs=num_docs_to_search)

# Displaying the search results
for doc, score, doc_id in zip(documents, document_scores, document_ids):
    print(f"Document ID: {doc_id}, Similarity: {score:.3f}")
    print("-" * 30)
    print(doc)
    print("-" * 30)
    print()

Searching for documents using the keywords 'cryptography' and 'privacy'
Document ID: 516146, Similarity: 0.408
------------------------------
small unmanned aerial vehicles (uavs) are becoming potential threats tosecurity-sensitive areas and personal privacy. a uav can shoot photos atheight, but how to detect such an uninvited intruder is an open problem. thispaper presents mmhawkeye, a passive approach for uav detection with a cotsmillimeter wave (mmwave) radar. mmhawkeye doesn't require prior knowledge ofthe type, motions, and flight trajectory of the uav, while exploiting thesignal feature induced by the uav's periodic micro-motion (pmm) for long-rangeaccurate detection. the design is therefore effective in dealing with low-snrand uncertain reflected signals from the uav. mmhawkeye can further track theuav's position with dynamic programming and particle filtering, and identify itwith a long short-term memory (lstm) based detector. we implement mmhawkeye ona commercial mmwave radar 

In [None]:
#@title Search for similar words
# List of initial keywords for investigation
initial_keywords = ["natural","disasters", "airquality","hydrology","ecology", "monitoring", "radars", "environment"]

# Number of words to search for each keyword
num_similar_words = 20

# Iterating over the list of initial keywords
for keyword in initial_keywords:
    print(f"Searching for words similar to '{keyword}'")
    print("=" * 30)

    # Performing the search for similar words
    words, word_scores = model2.similar_words(keywords=[keyword], keywords_neg=[], num_words=num_similar_words)

    # Displaying only the first 5 found words and their similarity scores
    for word, score in zip(words[:5], word_scores[:5]):  # Limiting the output to the first 5 words
        print(f"Word: {word}, Similarity: {score:.3f}")
    print()

Searching for words similar to 'natural'
Word: unnatural, Similarity: 0.700
Word: naturally, Similarity: 0.664
Word: nature, Similarity: 0.622
Word: natures, Similarity: 0.615
Word: naturalness, Similarity: 0.577

Searching for words similar to 'disasters'
Word: catastrophe, Similarity: 0.789
Word: disaster, Similarity: 0.787
Word: catastrophic, Similarity: 0.629
Word: disastrous, Similarity: 0.618
Word: fault, Similarity: 0.607

Searching for words similar to 'airquality'
Word: pairedwith, Similarity: 0.695
Word: arobust, Similarity: 0.681
Word: thegeneral, Similarity: 0.661
Word: avirtual, Similarity: 0.661
Word: havesimilar, Similarity: 0.658

Searching for words similar to 'hydrology'
Word: geophysics, Similarity: 0.666
Word: epidemiology, Similarity: 0.625
Word: geophysical, Similarity: 0.602
Word: rateless, Similarity: 0.581
Word: percolation, Similarity: 0.576

Searching for words similar to 'ecology'
Word: ecological, Similarity: 0.731
Word: thecomparative, Similarity: 0.687
Wo

In [None]:
#@title wordcloud
# topic_words, word_scores, topic_scores, topic_nums = model.search_topics(keywords=["emissions","disasters", "monitoring","emergency"], num_topics=2)
# for topic in topic_nums:
#     model.generate_topic_wordcloud(topic)

### Uploading to Hugging Face

In [None]:
!cp -r /content/arxiv_cs_from2010to2024-01-01 "/content/drive/MyDrive/AI/top2vec"


cp: cannot stat '/content/arxiv_cs_from2010to2024-01-01': No such file or directory


In [None]:
from huggingface_hub import notebook_login

notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from huggingface_hub import HfApi
api = HfApi()
api.upload_file(
    path_or_fileobj="/content/drive/MyDrive/AI/top2vec/arxiv_cs_from2010to2024-01-01", #
    path_in_repo="top2vec_model_arxiv_cs_from2010to2024-01-01", #
    repo_id="CCRss/topic_modeling_temp_name", #
    repo_type="model", #
)

ValueError: Provided path: '/content/drive/MyDrive/AI/top2vec/arxiv_cs_from2010to2024-01-01' is not a file on the local file system

In [None]:
from huggingface_hub import HfApi
import os

api = HfApi()
folder_path = "/content/my_model_dir"
repo_id = "CCRss/top2vec_science_abstracts"
folder_in_repo = "BERTopic_model/"


for filename in os.listdir(folder_path):
    file_path = os.path.join(folder_path, filename)
    if os.path.isfile(file_path):
        api.upload_file(
            path_or_fileobj=file_path,
            path_in_repo=folder_in_repo + filename,
            repo_id=repo_id,
            repo_type="model"
        )