The Future of LLM-Based Agents: Making the Boxes Bigger

Arcus Team
June 17, 2024

Introduction

AI Agents are a promising approach for using Large Language Models (LLMs) to do real work. However, today’s agent frameworks lack the reliability to handle complex, compound tasks out-of-the-box, preventing them from being useful in many real-world applications. In order to make agent frameworks valuable for doing real work, we need to increase the scope of what they can do. By making them better able to handle higher-order complex, compound tasks, we can make the “boxes bigger”, which we believe to be the future of AI and Agents and a major focus for us at Arcus.

While the potential of multi-agent frameworks that can plan, reason, remember, and reflect in order to solve real-world tasks is evident as described in Part 1 (AI Agents: Models that Do Real Work) of this series, there’s still much room for improvement. In this post, we cover two of the main areas of innovation that at Arcus we work on and believe are required to take agents out of the playground and into the real world: long-term planning and system-level robustness.

Long-Term Planning

To increase the scope of problems that agents can solve, models need the ability to develop longer-term plans to execute on; this is necessary to make sure that agents stay on track against their goal. The most common agent frameworks used today have a greedy approach to planning – meaning that at any point in time, an agent will determine specifically the immediate next step to take.

For complex tasks however, this greedy approach can be error-prone and unreliable. For example, imagine a person who takes a greedy approach to traveling within NYC. Suppose they want to visit Times Square. However, while trying to get there, they may encounter unexpected road closures due to a parade, leading to delay and confusion. Since they only planned one step at a time, they have no contingency plan and end up lost or delayed. In contrast, an upfront longer term plan could involve identifying a potential obstacle, finding alternative routes, and developing contingency plans to handle unforeseen events. This way, they can navigate more efficiently and adapt to changes without losing sight of the end goal. Similarly, AI agents need the capability to create and adapt against long-term plans to remain effective in dynamic environments.

One of the most crucial components of long-term planning is the ability to decompose a larger goal into tractable sub-goals. Normally, agents are presented with a final goal or an objective of what to do. This might be something like “Book an itinerary for a business trip to NYC for March 13 - 19 that matches my company’s expense policy.” However, this overarching goal can be composed into individual sub-goals that can be executed independently. For example, “Understand allowable expenses for domestic flights and hotels based on my company’s expense policy” is a sub-goal of that listed above. This kind of decomposition into sub-goals allows for a bigger-picture, higher-level plan of the steps that need to be taken.

Decomposing longer-term plans into sub-goals allows us to distribute different sub-goals to more specialized “executor agents” which are better at solving certain types of tasks. Each goal may require taking multiple actions and reasoning steps and the capabilities and tools required to solve each sub-goal often differ from one to the next. For example, an agent equipped with web browsing capabilities and knowledge of world airports is going to be better at executing a sub-goal like “book a flight” than a different agent that’s only equipped with calculation tools.

While the benefits of long-term planning are clear, there are two main challenges to having an agent framework that handles this well:

  1. Decomposing large goals into sub-goals with the scope and required capabilities that match the capabilities of executor agents.
  2. Adapting plans mid-episode. While we might have an original plan that needs to be executed for a given goal, if individual sub-goals get derailed or the plan doesn’t work as expected, the framework needs to be able to revise its long term plan in order to achieve the desired goal.

Putting long-term plans together, with the ability for agents to adapt and reflect at each sub-goal enables them to solve higher order tasks that require better upfront planning. Adaptation at each sub-goal lets them iterate in a dynamic environment to stimuli at each step. This long-term planning approach makes the boxes bigger by increasing the scope and complexity of tasks that an agentic system is able to decompose into executable plans.

Bringing Robustness to Agent Frameworks

LLMs are the building blocks of today’s AI agents – each agent executes a series of LLM calls in order to reason and determine what actions to take next. While LLMs are a powerful technology, their reasoning can still often be unreliable, as they can be prone to hallucinations and errors. An LLM’s ability to accurately determine which tools to use and with what parameters is especially critical and is still not at a production standard for complex tasks today.

As of the time this article was written, the top performing model on the Berkeley Function Calling Leaderboard has an average tool-calling accuracy of 87% across a series of tool-calling tasks. Imagine a scenario where five tool calls need to be made correctly in succession in order for the successful completion of a goal. Assuming each tool call is independent, the expected probability of successful completion of the goal is only (0.87)^5 = 49.8%. As the complexity of a goal increases, the number of tool calls required to achieve that goal increases, dropping the probability of success exponentially and preventing real-world, production-worthy performance. 

In order to make agents that can reliably solve more complex tasks, it’s important to build robustness through systems around the tool-calling capabilities of LLMs. While there’s a lot of hidden complexity in building robustness into agent workflows, there are two main design patterns that when orchestrated correctly can scale up the complexity of tasks that agents can solve reliably:

  1. Ensembles: Ensembling is a classic technique that follows the common wisdom that there’s strength in numbers – generating predictions from multiple models and using a majority vote to determine a final prediction. It’s recently been found that “More Agents Are All You Need” – using a simple majority voting scheme across many agents’ predictions can greatly improve performance on many tasks. For best performance, we find it’s best to ensemble across a diverse set of models to get more “diverse perspectives” into a problem.
  2. Retries: When an LLM call doesn’t go as expected, having a way to intelligently retry the call is a good way to create robustness. There are a few unique considerations to retrying LLM calls that require further innovation to work robustly:
    1. Evaluation. Even knowing when an LLM call is incorrect can be tricky. For example, an incorrect use of a tool. Within multi-agent frameworks, it’s useful to use agents that “double check” the work of another agent to ensure it’s executed correctly.
    2. Idempotency. Some actions that LLMs take are inherently not retryable without consequences – because not all LLM actions are idempotent. For example, consider an agent that books a flight for you – if the agent makes a mistake and books the wrong flight, retrying this can be costly, and doesn’t undo the previous mistake. In the case of non-idempotent actions, retries don’t work and can’t prevent an agent’s mistakes.

These kinds of system approaches – which orchestrate LLM agents together to provide higher order guarantees, determinism and checks-and-balances – can provide better reliability when performing compound tasks. As the complexity of the tasks increase, so does the need for increased reliability at each step. Compound AI systems bake in various steps that provide guarantees over the performance of the overall system against the longer term plan when compared to individual steps or sub-goals.

Conclusion

LLM-based agents have massive potential to change the way AI is used to do real work. At Arcus, we believe that in order to make agents solve increasingly complex tasks, innovation has to come two-fold:

  1. Systems for long-term planning and reliable sub-goal execution.
  2. Systems of robustness around LLMs and Agents to provide reliability and determinism.

At Arcus, we work on hard problems like these to make AI do real work and accelerate our mission of bringing intelligence to complex data and processes. We work on emergent capabilities in data understanding, reasoning, and action-taking. If this sounds exciting to you, we’d love to meet you! We’re hiring across many roles – check out our careers page or reach out to us at recruiting@arcus.co.

What’s a Rich Text element?

The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.

Static and dynamic content editing

A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!

How to customize formatting for each rich text

Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.
Interested in what Arcus can do for your AI applications?
get early access
Want to get in touch? Reach out at hello@arcus.co!