Spaces:

awacke1
/

Multi-Agent-Systems-MAS-Autonomous-Agents-Interacting

Runtime error

App Files Files Community

awacke1 commited on Apr 11, 2024

Commit

d1b1e68

verified ·

1 Parent(s): 4876cb6

Update app.py

Browse files

Files changed (1) hide show

app.py +2 -477

app.py CHANGED Viewed

@@ -39,7 +39,7 @@ st.markdown("---")
 st.markdown("## **Interaction Protocol** 🤝 :bulb:**")
 st.markdown("### **Key Elements** :guards:")
-st.markdown(f"""
         1. **Communication** 🗣 \n
             - Agents exchange information \n
         2. **Cooperation** 🤝 \n
@@ -116,7 +116,7 @@ https://aka.ms/kosmos-2.
 ---------------
 ### 19 Feb 2024 | [ScreenAI: A Vision-Language Model for UI and Infographics Understanding](https://arxiv.org/abs/2402.04615) | [⬇️](https://arxiv.org/pdf/2402.04615)
-*Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan  Mansoor, Vincent Etter, Victor C\u{a}rbune, Jason Lin, Jindong Chen, Abhanshu  Sharma*
   Screen user interfaces (UIs) and infographics, sharing similar visual
 language and design principles, play important roles in human communication and
@@ -266,479 +266,4 @@ Python interpreter and uniquely tailored to perform sophisticated tasks (e.g.,
 model training) using existing libraries and autonomously self-debug.
 ---------------
-### 24 Jan 2024 | [VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web  Tasks](https://arxiv.org/abs/2401.13649) | [⬇️](https://arxiv.org/pdf/2401.13649)
-*Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim,  Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, Daniel Fried*
-  Autonomous agents capable of planning, reasoning, and executing actions on
-the web offer a promising avenue for automating computer tasks. However, the
-majority of existing benchmarks primarily focus on text-based agents,
-neglecting many natural tasks that require visual information to effectively
-solve. Given that most computer interfaces cater to human perception, visual
-information often augments textual data in ways that text-only models struggle
-to harness effectively. To bridge this gap, we introduce VisualWebArena, a
-benchmark designed to assess the performance of multimodal web agents on
-realistic \textit{visually grounded tasks}. VisualWebArena comprises of a set
-of diverse and complex web-based tasks that evaluate various capabilities of
-autonomous multimodal agents. To perform on this benchmark, agents need to
-accurately process image-text inputs, interpret natural language instructions,
-and execute actions on websites to accomplish user-defined objectives. We
-conduct an extensive evaluation of state-of-the-art LLM-based autonomous
-agents, including several multimodal models. Through extensive quantitative and
-qualitative analysis, we identify several limitations of text-only LLM agents,
-and reveal gaps in the capabilities of state-of-the-art multimodal language
-agents. VisualWebArena provides a framework for evaluating multimodal
-autonomous language agents, and offers insights towards building stronger
-autonomous agents for the web. Our code, baseline models, and data is publicly
-available at https://jykoh.com/vwa.
----------------
-### 22 Feb 2018 | [Multimodal Named Entity Recognition for Short Social Media Posts](https://arxiv.org/abs/1802.07862) | [⬇️](https://arxiv.org/pdf/1802.07862)
-*Seungwhan Moon, Leonardo Neves, Vitor Carvalho*
-  We introduce a new task called Multimodal Named Entity Recognition (MNER) for
-noisy user-generated data such as tweets or Snapchat captions, which comprise
-short text with accompanying images. These social media posts often come in
-inconsistent or incomplete syntax and lexical notations with very limited
-surrounding textual contexts, bringing significant challenges for NER. To this
-end, we create a new dataset for MNER called SnapCaptions (Snapchat
-image-caption pairs submitted to public and crowd-sourced stories with fully
-annotated named entities). We then build upon the state-of-the-art Bi-LSTM
-word/character based NER models with 1) a deep image network which incorporates
-relevant visual context to augment textual information, and 2) a generic
-modality-attention module which learns to attenuate irrelevant modalities while
-amplifying the most informative ones to extract contexts from, adaptive to each
-sample and token. The proposed MNER model with modality attention significantly
-outperforms the state-of-the-art text-only NER models by successfully
-leveraging provided visual contexts, opening up potential applications of MNER
-on myriads of social media platforms.
----------------
-### 21 Sep 2023 | [You Only Look at Screens: Multimodal Chain-of-Action Agents](https://arxiv.org/abs/2309.11436) | [⬇️](https://arxiv.org/pdf/2309.11436)
-*Zhuosheng Zhang, Aston Zhang*
-  Autonomous user interface (UI) agents aim to facilitate task automation by
-interacting with the user interface without manual intervention. Recent studies
-have investigated eliciting the capabilities of large language models (LLMs)
-for effective engagement in diverse environments. To align with the
-input-output requirement of LLMs, existing approaches are developed under a
-sandbox setting where they rely on external tools and application-specific APIs
-to parse the environment into textual elements and interpret the predicted
-actions. Consequently, those approaches often grapple with inference
-inefficiency and error propagation risks. To mitigate the challenges, we
-introduce Auto-UI, a multimodal solution that directly interacts with the
-interface, bypassing the need for environment parsing or reliance on
-application-dependent APIs. Moreover, we propose a chain-of-action technique --
-leveraging a series of intermediate previous action histories and future action
-plans -- to help the agent decide what action to execute. We evaluate our
-approach on a new device-control benchmark AITW with 30K unique instructions,
-spanning multi-step tasks such as application operation, web searching, and web
-shopping. Experimental results show that Auto-UI achieves state-of-the-art
-performance with an action type prediction accuracy of 90% and an overall
-action success rate of 74%. Code is publicly available at
-https://github.com/cooelf/Auto-UI.
----------------
-### 06 Jun 2023 | [LIDA: A Tool for Automatic Generation of Grammar-Agnostic Visualizations  and Infographics using Large Language Models](https://arxiv.org/abs/2303.02927) | [⬇️](https://arxiv.org/pdf/2303.02927)
-*Victor Dibia*
-  Systems that support users in the automatic creation of visualizations must
-address several subtasks - understand the semantics of data, enumerate relevant
-visualization goals and generate visualization specifications. In this work, we
-pose visualization generation as a multi-stage generation problem and argue
-that well-orchestrated pipelines based on large language models (LLMs) such as
-ChatGPT/GPT-4 and image generation models (IGMs) are suitable to addressing
-these tasks. We present LIDA, a novel tool for generating grammar-agnostic
-visualizations and infographics. LIDA comprises of 4 modules - A SUMMARIZER
-that converts data into a rich but compact natural language summary, a GOAL
-EXPLORER that enumerates visualization goals given the data, a VISGENERATOR
-that generates, refines, executes and filters visualization code and an
-INFOGRAPHER module that yields data-faithful stylized graphics using IGMs. LIDA
-provides a python api, and a hybrid user interface (direct manipulation and
-multilingual natural language) for interactive chart, infographics and data
-story generation. Learn more about the project here -
-https://microsoft.github.io/lida/
----------------
-### 16 Feb 2023 | [VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video  Paragraph Captioning](https://arxiv.org/abs/2211.15103) | [⬇️](https://arxiv.org/pdf/2211.15103)
-*Kashu Yamazaki, Khoa Vo, Sang Truong, Bhiksha Raj, Ngan Le*
-  Video paragraph captioning aims to generate a multi-sentence description of
-an untrimmed video with several temporal event locations in coherent
-storytelling. Following the human perception process, where the scene is
-effectively understood by decomposing it into visual (e.g. human, animal) and
-non-visual components (e.g. action, relations) under the mutual influence of
-vision and language, we first propose a visual-linguistic (VL) feature. In the
-proposed VL feature, the scene is modeled by three modalities including (i) a
-global visual environment; (ii) local visual main agents; (iii) linguistic
-scene elements. We then introduce an autoregressive Transformer-in-Transformer
-(TinT) to simultaneously capture the semantic coherence of intra- and
-inter-event contents within a video. Finally, we present a new VL contrastive
-loss function to guarantee learnt embedding features are matched with the
-captions semantics. Comprehensive experiments and extensive ablation studies on
-ActivityNet Captions and YouCookII datasets show that the proposed
-Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms prior
-state-of-the-art methods on accuracy and diversity. Source code is made
-publicly available at: https://github.com/UARK-AICV/VLTinT.
----------------
-### 04 Mar 2021 | [FAtiMA Toolkit -- Toward an effective and accessible tool for the  development of intelligent virtual agents and social robots](https://arxiv.org/abs/2103.03020) | [⬇️](https://arxiv.org/pdf/2103.03020)
-*Samuel Mascarenhas, Manuel Guimar\~aes, Pedro A. Santos, Jo\~ao Dias,  Rui Prada, Ana Paiva*
-  More than a decade has passed since the development of FearNot!, an
-application designed to help children deal with bullying through role-playing
-with virtual characters. It was also the application that led to the creation
-of FAtiMA, an affective agent architecture for creating autonomous characters
-that can evoke empathic responses. In this paper, we describe FAtiMA Toolkit, a
-collection of open-source tools that is designed to help researchers, game
-developers and roboticists incorporate a computational model of emotion and
-decision-making in their work. The toolkit was developed with the goal of
-making FAtiMA more accessible, easier to incorporate into different projects
-and more flexible in its capabilities for human-agent interaction, based upon
-the experience gathered over the years across different virtual environments
-and human-robot interaction scenarios. As a result, this work makes several
-different contributions to the field of Agent-Based Architectures. More
-precisely, FAtiMA Toolkit's library based design allows developers to easily
-integrate it with other frameworks, its meta-cognitive model affords different
-internal reasoners and affective components and its explicit dialogue structure
-gives control to the author even within highly complex scenarios. To
-demonstrate the use of FAtiMA Toolkit, several different use cases where the
-toolkit was successfully applied are described and discussed.
----------------
-### 12 Sep 2022 | [emojiSpace: Spatial Representation of Emojis](https://arxiv.org/abs/2209.09871) | [⬇️](https://arxiv.org/pdf/2209.09871)
-*Moeen Mostafavi, Mahsa Pahlavikhah Varnosfaderani, Fateme Nikseresht,  Seyed Ahmad Mansouri*
-  In the absence of nonverbal cues during messaging communication, users
-express part of their emotions using emojis. Thus, having emojis in the
-vocabulary of text messaging language models can significantly improve many
-natural language processing (NLP) applications such as online communication
-analysis. On the other hand, word embedding models are usually trained on a
-very large corpus of text such as Wikipedia or Google News datasets that
-include very few samples with emojis. In this study, we create emojiSpace,
-which is a combined word-emoji embedding using the word2vec model from the
-Genism library in Python. We trained emojiSpace on a corpus of more than 4
-billion tweets and evaluated it by implementing sentiment analysis on a Twitter
-dataset containing more than 67 million tweets as an extrinsic task. For this
-task, we compared the performance of two different classifiers of random forest
-(RF) and linear support vector machine (SVM). For evaluation, we compared
-emojiSpace performance with two other pre-trained embeddings and demonstrated
-that emojiSpace outperforms both.
----------------
-### 27 Jan 2020 | [CodeReef: an open platform for portable MLOps, reusable automation  actions and reproducible benchmarking](https://arxiv.org/abs/2001.07935) | [⬇️](https://arxiv.org/pdf/2001.07935)
-*Grigori Fursin, Herve Guillou and Nicolas Essayan*
-  We present CodeReef - an open platform to share all the components necessary
-to enable cross-platform MLOps (MLSysOps), i.e. automating the deployment of ML
-models across diverse systems in the most efficient way. We also introduce the
-CodeReef solution - a way to package and share models as non-virtualized,
-portable, customizable and reproducible archive files. Such ML packages include
-JSON meta description of models with all dependencies, Python APIs, CLI actions
-and portable workflows necessary to automatically build, benchmark, test and
-customize models across diverse platforms, AI frameworks, libraries, compilers
-and datasets. We demonstrate several CodeReef solutions to automatically build,
-run and measure object detection based on SSD-Mobilenets, TensorFlow and COCO
-dataset from the latest MLPerf inference benchmark across a wide range of
-platforms from Raspberry Pi, Android phones and IoT devices to data centers.
-Our long-term goal is to help researchers share their new techniques as
-production-ready packages along with research papers to participate in
-collaborative and reproducible benchmarking, compare the different
-ML/software/hardware stacks and select the most efficient ones on a Pareto
-frontier using online CodeReef dashboards.
----------------
-### 28 Feb 2024 | [OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist  Autonomous Agents for Desktop and Web](https://arxiv.org/abs/2402.17553) | [⬇️](https://arxiv.org/pdf/2402.17553)
-*Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran  Kamble, Waseem Alshikh, Ruslan Salakhutdinov*
-  For decades, human-computer interaction has fundamentally been manual. Even
-today, almost all productive work done on the computer necessitates human input
-at every step. Autonomous virtual agents represent an exciting step in
-automating many of these menial tasks. Virtual agents would empower users with
-limited technical proficiency to harness the full possibilities of computer
-systems. They could also enable the efficient streamlining of numerous computer
-tasks, ranging from calendar management to complex travel bookings, with
-minimal human intervention. In this paper, we introduce OmniACT, the
-first-of-a-kind dataset and benchmark for assessing an agent's capability to
-generate executable programs to accomplish computer tasks. Our scope extends
-beyond traditional web automation, covering a diverse range of desktop
-applications. The dataset consists of fundamental tasks such as "Play the next
-song", as well as longer horizon tasks such as "Send an email to John Doe
-mentioning the time and place to meet". Specifically, given a pair of screen
-image and a visually-grounded natural language task, the goal is to generate a
-script capable of fully executing the task. We run several strong baseline
-language model agents on our benchmark. The strongest baseline, GPT-4, performs
-the best on our benchmark However, its performance level still reaches only 15%
-of the human proficiency in generating executable scripts capable of completing
-the task, demonstrating the challenge of our task for conventional web agents.
-Our benchmark provides a platform to measure and evaluate the progress of
-language model agents in automating computer tasks and motivates future work
-towards building multimodal models that bridge large language models and the
-visual grounding of computer screens.
----------------
-### 24 Mar 2021 | [Proactive Interaction Framework for Intelligent Social Receptionist  Robots](https://arxiv.org/abs/2012.04832) | [⬇️](https://arxiv.org/pdf/2012.04832)
-*Yang Xue, Fan Wang, Hao Tian, Min Zhao, Jiangyong Li, Haiqing Pan and  Yueqiang Dong*
-  Proactive human-robot interaction (HRI) allows the receptionist robots to
-actively greet people and offer services based on vision, which has been found
-to improve acceptability and customer satisfaction. Existing approaches are
-either based on multi-stage decision processes or based on end-to-end decision
-models. However, the rule-based approaches require sedulous expert efforts and
-only handle minimal pre-defined scenarios. On the other hand, existing works
-with end-to-end models are limited to very general greetings or few behavior
-patterns (typically less than 10). To address those challenges, we propose a
-new end-to-end framework, the TransFormer with Visual Tokens for Human-Robot
-Interaction (TFVT-HRI). The proposed framework extracts visual tokens of
-relative objects from an RGB camera first. To ensure the correct interpretation
-of the scenario, a transformer decision model is then employed to process the
-visual tokens, which is augmented with the temporal and spatial information. It
-predicts the appropriate action to take in each scenario and identifies the
-right target. Our data is collected from an in-service receptionist robot in an
-office building, which is then annotated by experts for appropriate proactive
-behavior. The action set includes 1000+ diverse patterns by combining language,
-emoji expression, and body motions. We compare our model with other SOTA
-end-to-end models on both offline test sets and online user experiments in
-realistic office building environments to validate this framework. It is
-demonstrated that the decision model achieves SOTA performance in action
-triggering and selection, resulting in more humanness and intelligence when
-compared with the previous reactive reception policies.
----------------
-### 15 Mar 2023 | [Sustainable Cloud Services for Verbal Interaction with Embodied Agents](https://arxiv.org/abs/2203.02606) | [⬇️](https://arxiv.org/pdf/2203.02606)
-*Lucrezia Grassi, Carmine Tommaso Recchiuto, Antonio Sgorbissa*
-  This article presents the design and the implementation of a cloud system for
-knowledge-based autonomous interaction devised for Social Robots and other
-conversational agents. The system is particularly convenient for low-cost
-robots and devices: it can be used as a stand-alone dialogue system or as an
-integration to provide "background" dialogue capabilities to any preexisting
-Natural Language Processing ability that the robot may already have as part of
-its basic skills. By connecting to the cloud, developers are provided with a
-sustainable solution to manage verbal interaction through a network connection,
-with about 3,000 topics of conversation ready for "chit-chatting" and a library
-of pre-cooked plans that only needs to be grounded into the robot's physical
-capabilities. The system is structured as a set of REST API endpoints so that
-it can be easily expanded by adding new APIs to improve the capabilities of the
-clients connected to the cloud. Another key feature of the system is that it
-has been designed to make the development of its clients straightforward: in
-this way, multiple robots and devices can be easily endowed with the capability
-of autonomously interacting with the user, understanding when to perform
-specific actions, and exploiting all the information provided by cloud
-services. The article outlines and discusses the results of the experiments
-performed to assess the system's performance in terms of response time, paving
-the way for its use both for research and market solutions. Links to
-repositories with clients for ROS and popular robots such as Pepper and NAO are
-available on request.
----------------<s>[INST] Context:
- 1. <b> AgentAvatar: Disentangling Planning, Driving and Rendering for  Photorealistic Avatar Agents </b>
- Abstract:   In this study, our goal is to create interactive avatar agents that can
-autonomously plan and animate nuanced facial movements realistically, from both
-visual and behavioral perspectives. Given high-level inputs about the
-environment and agent profile, our framework harnesses LLMs to produce a series
-of detailed text descriptions of the avatar agents' facial motions. These
-descriptions are then processed by our task-agnostic driving engine into motion
-token sequences, which are subsequently converted into continuous motion
-embeddings that are further consumed by our standalone neural-based renderer to
-generate the final photorealistic avatar animations. These streamlined
-processes allow our framework to adapt to a variety of non-verbal avatar
-interactions, both monadic and dyadic. Our extensive study, which includes
-experiments on both newly compiled and existing datasets featuring two types of
-agents -- one capable of monadic interaction with the environment, and the
-other designed for dyadic conversation -- validates the effectiveness and
-versatility of our approach. To our knowledge, we advanced a leap step by
-combining LLMs and neural rendering for generalized non-verbal prediction and
-photo-realistic rendering of avatar agents.
-2. <b> Caption Anything: Interactive Image Description with Diverse Multimodal  Controls </b>
- Abstract:   Controllable image captioning is an emerging multimodal topic that aims to
-describe the image with natural language following human purpose,
-$\textit{e.g.}$, looking at the specified regions or telling in a particular
-text style. State-of-the-art methods are trained on annotated pairs of input
-controls and output captions. However, the scarcity of such well-annotated
-multimodal data largely limits their usability and scalability for interactive
-AI systems. Leveraging unimodal instruction-following foundation models is a
-promising alternative that benefits from broader sources of data. In this
-paper, we present Caption AnyThing (CAT), a foundation model augmented image
-captioning framework supporting a wide range of multimodel controls: 1) visual
-controls, including points, boxes, and trajectories; 2) language controls, such
-as sentiment, length, language, and factuality. Powered by Segment Anything
-Model (SAM) and ChatGPT, we unify the visual and language prompts into a
-modularized framework, enabling the flexible combination between different
-controls. Extensive case studies demonstrate the user intention alignment
-capabilities of our framework, shedding light on effective user interaction
-modeling in vision-language applications. Our code is publicly available at
-https://github.com/ttengwang/Caption-Anything.
-3. <b> Kosmos-2: Grounding Multimodal Large Language Models to the World </b>
- Abstract:   We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new
-capabilities of perceiving object descriptions (e.g., bounding boxes) and
-grounding text to the visual world. Specifically, we represent refer
-expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where
-object descriptions are sequences of location tokens. Together with multimodal
-corpora, we construct large-scale data of grounded image-text pairs (called
-GrIT) to train the model. In addition to the existing capabilities of MLLMs
-(e.g., perceiving general modalities, following instructions, and performing
-in-context learning), Kosmos-2 integrates the grounding capability into
-downstream applications. We evaluate Kosmos-2 on a wide range of tasks,
-including (i) multimodal grounding, such as referring expression comprehension,
-and phrase grounding, (ii) multimodal referring, such as referring expression
-generation, (iii) perception-language tasks, and (iv) language understanding
-and generation. This work lays out the foundation for the development of
-Embodiment AI and sheds light on the big convergence of language, multimodal
-perception, action, and world modeling, which is a key step toward artificial
-general intelligence. Code and pretrained models are available at
-https://aka.ms/kosmos-2.
-4. <b> ScreenAI: A Vision-Language Model for UI and Infographics Understanding </b>
- Abstract:   Screen user interfaces (UIs) and infographics, sharing similar visual
-language and design principles, play important roles in human communication and
-human-machine interaction. We introduce ScreenAI, a vision-language model that
-specializes in UI and infographics understanding. Our model improves upon the
-PaLI architecture with the flexible patching strategy of pix2struct and is
-trained on a unique mixture of datasets. At the heart of this mixture is a
-novel screen annotation task in which the model has to identify the type and
-location of UI elements. We use these text annotations to describe screens to
-Large Language Models and automatically generate question-answering (QA), UI
-navigation, and summarization training datasets at scale. We run ablation
-studies to demonstrate the impact of these design choices. At only 5B
-parameters, ScreenAI achieves new state-of-the-artresults on UI- and
-infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget
-Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and
-InfographicVQA) compared to models of similar size. Finally, we release three
-new datasets: one focused on the screen annotation task and two others focused
-on question answering.
-5. <b> ThingTalk: An Extensible, Executable Representation Language for  Task-Oriented Dialogues </b>
- Abstract:   Task-oriented conversational agents rely on semantic parsers to translate
-natural language to formal representations. In this paper, we propose the
-design and rationale of the ThingTalk formal representation, and how the design
-improves the development of transactional task-oriented agents.
-  ThingTalk is built on four core principles: (1) representing user requests
-directly as executable statements, covering all the functionality of the agent,
-(2) representing dialogues formally and succinctly to support accurate
-contextual semantic parsing, (3) standardizing types and interfaces to maximize
-reuse between agents, and (4) allowing multiple, independently-developed agents
-to be composed in a single virtual assistant. ThingTalk is developed as part of
-the Genie Framework that allows developers to quickly build transactional
-agents given a database and APIs.
-  We compare ThingTalk to existing representations: SMCalFlow, SGD, TreeDST.
-Compared to the others, the ThingTalk design is both more general and more
-cost-effective. Evaluated on the MultiWOZ benchmark, using ThingTalk and
-associated tools yields a new state of the art accuracy of 79% turn-by-turn.
-6. <b> 3D-GPT: Procedural 3D Modeling with Large Language Models </b>
- Abstract:   In the pursuit of efficient automated content creation, procedural
-generation, leveraging modifiable parameters and rule-based systems, emerges as
-a promising approach. Nonetheless, it could be a demanding endeavor, given its
-intricate nature necessitating a deep understanding of rules, algorithms, and
-parameters. To reduce workload, we introduce 3D-GPT, a framework utilizing
-large language models~(LLMs) for instruction-driven 3D modeling. 3D-GPT
-positions LLMs as proficient problem solvers, dissecting the procedural 3D
-modeling tasks into accessible segments and appointing the apt agent for each
-task. 3D-GPT integrates three core agents: the task dispatch agent, the
-conceptualization agent, and the modeling agent. They collaboratively achieve
-two objectives. First, it enhances concise initial scene descriptions, evolving
-them into detailed forms while dynamically adapting the text based on
-subsequent instructions. Second, it integrates procedural generation,
-extracting parameter values from enriched text to effortlessly interface with
-3D software for asset creation. Our empirical investigations confirm that
-3D-GPT not only interprets and executes instructions, delivering reliable
-results but also collaborates effectively with human designers. Furthermore, it
-seamlessly integrates with Blender, unlocking expanded manipulation
-possibilities. Our work highlights the potential of LLMs in 3D modeling,
-offering a basic framework for future advancements in scene generation and
-animation.
-7. <b> Embodied Task Planning with Large Language Models </b>
- Abstract:   Equipping embodied agents with commonsense is important for robots to
-successfully complete complex human instructions in general environments.
-Recent large language models (LLM) can embed rich semantic knowledge for agents
-in plan generation of complex tasks, while they lack the information about the
-realistic world and usually yield infeasible action sequences. In this paper,
-we propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning
-with physical scene constraint, where the agent generates executable plans
-according to the existed objects in the scene by aligning LLMs with the visual
-perception models. Specifically, we first construct a multimodal dataset
-containing triplets of indoor scenes, instructions and action plans, where we
-provide the designed prompts and the list of existing objects in the scene for
-GPT-3.5 to generate a large number of instructions and corresponding planned
-actions. The generated data is leveraged for grounded plan tuning of
-pre-trained LLMs. During inference, we discover the objects in the scene by
-extending open-vocabulary object detectors to multi-view RGB images collected
-in different achievable locations. Experimental results show that the generated
-plan from our TaPA framework can achieve higher success rate than LLaVA and
-GPT-3.5 by a sizable margin, which indicates the practicality of embodied task
-planning in general and complex environments.
-8. <b> Joint Representation Learning for Text and 3D Point Cloud </b>
- Abstract:   Recent advancements in vision-language pre-training (e.g. CLIP) have shown
-that vision models can benefit from language supervision. While many models
-using language modality have achieved great success on 2D vision tasks, the
-joint representation learning of 3D point cloud with text remains
-under-explored due to the difficulty of 3D-Text data pair acquisition and the
-irregularity of 3D data structure. In this paper, we propose a novel Text4Point
-framework to construct language-guided 3D point cloud models. The key idea is
-utilizing 2D images as a bridge to connect the point cloud and the language
-modalities. The proposed Text4Point follows the pre-training and fine-tuning
-paradigm. During the pre-training stage, we establish the correspondence of
-images and point clouds based on the readily available RGB-D data and use
-contrastive learning to align the image and point cloud representations.
-Together with the well-aligned image and text features achieved by CLIP, the
-point cloud features are implicitly aligned with the text embeddings. Further,
-we propose a Text Querying Module to integrate language information into 3D
-representation learning by querying text embeddings with point cloud features.
-For fine-tuning, the model learns task-specific 3D representations under
-informative language guidance from the label set without 2D images. Extensive
-experiments demonstrate that our model shows consistent improvement on various
-downstream tasks, such as point cloud semantic segmentation, instance
-segmentation, and object detection. The code will be available here:
-https://github.com/LeapLabTHU/Text4Point
-9. <b> Executable Code Actions Elicit Better LLM Agents </b>
- Abstract:   Large Language Model (LLM) agents, capable of performing a broad range of
-actions, such as invoking tools and controlling robots, show great potential in
-tackling real-world challenges. LLM agents are typically prompted to produce
-actions by generating JSON or text in a pre-defined format, which is usually
-limited by constrained action space (e.g., the scope of pre-defined tools) and
-restricted flexibility (e.g., inability to compose multiple tools). This work
-proposes to use executable Python code to consolidate LLM agents' actions into
-a unified action space (CodeAct). Integrated with a Python interpreter, CodeAct
-can execute code actions and dynamically revise prior actions or emit new
-actions upon new observations through multi-turn interactions. Our extensive
-analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that
-CodeAct outperforms widely used alternatives (up to 20% higher success rate).
-The encouraging performance of CodeAct motivates us to build an open-source LLM
-agent that interacts with environments by executing interpretable code and
-collaborates with users using natural language. To this end, we collect an
-instruction-tuning dataset CodeActInstruct that consists of 7k multi-turn
-interactions using CodeAct. We show that it can be used with existing data to
-improve models in agent-oriented tasks without compromising their general
-capability. CodeActAgent, finetuned from Llama2 and Mistral, is integrated with
-Python interpreter and uniquely tailored to perform sophisticated tasks (e.g.,
-model training) using existing libraries and autonomously self-debug.
-10. <b> VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web  Tasks </b>
- Abstract:   Autonomous agents capable of planning, reasoning, and executing actions on
-the web offer a promising avenue for automating computer tasks. However, the
-majority of existing benchmarks primarily focus on text-based agents,
-neglecting many natural tasks that require visual information to effectively
-solve. Given that most computer interfaces cater to human perception, visual
-information often augments textual data in ways that text-only models struggle
-to harness effectively. To bridge this gap, we introduce VisualWebArena, a
-benchmark designed to assess the performance of multimodal web agents on
-realistic \textit{visually grounded tasks}. VisualWebArena comprises of a set
-of diverse and complex web-based tasks that evaluate various capabilities of
-autonomous multimodal agents. To perform on this benchmark, agents need to
-accurately process image-text inputs, interpret natural language instructions,
-and execute actions on websites to accomplish user-defined objectives. We
-conduct an extensive evaluation of state-of-the-art LLM-based autonomous
-agents, including several multimodal models. Through extensive quantitative and
-qualitative analysis, we identify several limitations of text-only LLM agents,
-and reveal gaps in the capabilities of state-of-the-art multimodal language
-agents. VisualWebArena provides a framework for evaluating multimodal
-autonomous language agents, and offers insights towards building stronger
-autonomous agents for the web.
 """)

 st.markdown("## **Interaction Protocol** 🤝 :bulb:**")
 st.markdown("### **Key Elements** :guards:")
+st.markdown("""
         1. **Communication** 🗣 \n
             - Agents exchange information \n
         2. **Cooperation** 🤝 \n
 ---------------
 ### 19 Feb 2024 | [ScreenAI: A Vision-Language Model for UI and Infographics Understanding](https://arxiv.org/abs/2402.04615) | [⬇️](https://arxiv.org/pdf/2402.04615)
+*Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan  Mansoor, Vincent Etter, Victor Crbune, Jason Lin, Jindong Chen, Abhanshu  Sharma*
   Screen user interfaces (UIs) and infographics, sharing similar visual
 language and design principles, play important roles in human communication and
 model training) using existing libraries and autonomously self-debug.
 ---------------
 """)