In mid-March, the American start-up Cognition introduced the AI assistant Devin. It should be able to automatically implement any programming project based on a few natural language instructions. In contrast to previous AI-based approaches to code generation such as Microsoft Copilot, Devin does not limit himself to generating or optimizing individual code blocks, but rather implements entire software projects. According to Cognition, Devin will radically change the way software developers do their work in the future. The examples shown as evidence in the announcement are impressive. But the feedback in practice is mixed.
Advertisement
Lars Röwekamp, founder of the IT consulting and development company open knowledge GmbH, deals with the in-depth analysis and evaluation of new software and technology trends as part of his work as “CIO New Technologies”, with a focus on the areas of machine learning and cloud computing. A particular focus of his work is the embedding of ML-based processes and applications into the existing IT landscape of companies. Together with his customers, he analyzes the possibilities – and limits – of expanding and optimizing business processes based on ML-based approaches and, with the support of his team, implements these until they are ready for production. Lars is the author of several specialist articles and books.
And the score of 14 percent achieved on a subset of the coding problems included in the SWE benchmark also suggests a big leap in AI-based software generation. For comparison, the leading results so far: Claude 2 comes in at 4.8 percent), Llama at 3.9 percent and GPT-4 at 1.5 percent.
However, reactions to Devin's abilities in the community are mixed: while Aravind Srinivas, CEO of Perplexity AI, writes that the autonomous AI coder appears to have “crossed the threshold” of human capabilities, Evan You, a Singaporean, describes it resident developer described the AI assistant as “pretty inadequate.” He added that a developer who completes tasks correctly only 14 percent of the time is more of a “burden” than an “enrichment.”
But how should Devin's abilities actually be assessed? Are software engineers threatened with extinction?
Companion instead of replacement
The good news first: Even Cognition sees Devin as more of a companion or colleague than a full-fledged replacement for a software developer. “Devin is a tireless, capable teammate, equally willing to work alongside you or independently complete tasks for you to review,” Cognition writes in a blog post. “With Devin, software engineers can focus on more interesting problems and development teams can aim for more ambitious goals.”
In order to realistically evaluate Devin's capabilities and limitations, it is important to understand how such a system works under the hood. Since no details about how Devin works have been published yet, the following explanations are based on the various demos on the Cognition blog.
a look behind the scenes
Based on a given task, for example a performance comparison for the use of a specific LLM (Large Language Model) with different providers, Devin first creates a plan with the necessary steps to implement the project. The more detailed the human client formulates the requirement, the more targeted the plan is.
The system then implements the plan using different AI agents and tools. To do this, Devin starts a so-called agent loop, which runs until she answers the request satisfactorily. Within the loop, in each step there is a combination of reasoning and the resulting action. Both procedures have been known and established in the LLM environment for a long time. The trick with systems like Devin is to combine the two procedures in the form of ReAct agents and to use the knowledge collected from the previous steps as a basis for decision-making for the following step and the associated action.
The Minds Mastering Machines will take place in Cologne on April 24th and 25th, 2024. The of iX and dpunkt.verlag The specialist conference organized beyond the AI hype is aimed at data scientists, data engineers and developers who turn machine learning projects into reality.
The conference program offers a good 30 lectures in three tracks over two days. Lars Röwekamp, the author of this article, gives a lecture on the balancing act between bias and fairness.
On the second day of the conference there will also be a panel discussion on the effects of the AI Act.
The agents' necessary memory can be provided using the RAG system (Retrieval Augmented Generation). Depending on the task, the code or information fragment necessary for the next step is selected from the RAG system and passed on to the LLM. This trick ensures that the fragments remain small in order to overcome the limitations of the context sizes. This in turn enables the system to complete complex tasks that can consist of several thousand steps and a corresponding number of decisions and actions.
By the way, it is interesting that the prompting UI of the Devin Workbench is not only used to set the initial task, but also for subsequent human-machine interaction during processing. With the help of this “human in the loop” agent, it is possible, for example, to give Devin feedback on decisions made and the resulting actions, to dynamically adjust the priorities of the tasks or to revise the task list as necessary.
From the perspective of the AI methods used, Devin doesn't really offer anything new, but simply cleverly combines what already exists. This is in no way intended to diminish the performance of the undisputedly gifted team at Cognition. After all, they won a total of ten gold medals at the notoriously difficult international IT Olympiad.
Realistically speaking, despite all the euphoria that has arisen over the past two weeks and the prophecy that the age of Artificial General Intelligence (AGI) has finally dawned, Devin is reaching his limits.