AI Week in Review 26.02.28

Figure 1. Google’s Nano Banana 2 is not as censored as it once was, and it can produce realistic images in a variety of formats at high resolution, like this not-real Tucker Carlson interview of Ayatollah Khamenei.When the OpenClaw autonomous agent went viral last month, it was a prototype of the future. Now, the future is here, with an explosion of new AI agent releases and features, including some of the same AI agent features that made OpenClaw compelling.Perplexity introduced Perplexity Computer, an autonomous AI task system capable of executing extended workflows, scheduled tasks, and long-duration projects. The AI system performs sequential tasks within persistent compute environments. By orchestrating AI agent tasks across 19 specialized frontier models in parallel, the system acts as a digital coworker that can break broad project descriptions into isolated sub-tasks for research, coding, and deployment. It is launching for Max users first, and live examples are here.Cursor launched cloud-based autonomous Agents that run in cloud-hosted isolated VMs and deliver video demos of completed code changes. This capability allows agents to autonomously go onboard into codebases, complete tasks, and visually display results. It expands the scope of agentic AI from local to managed cloud environments:“Agents can onboard to your codebase, use a cloud computer to make changes, and send you a video demo of their finished work.”Nous Research launched Hermes Agent, an open-source AI agent system featuring a multi-level memory structure and dedicated remote terminal access support. Powered by the Hermes-3 AI model, Hermes Agent is a research-focused autonomous agent designed to assist with investigative and exploratory AI workflows. It has a memory feature across sessions by retaining its context within your local environment and it communicates through familiar integrations like Slack and Telegram.Anthropic continues to rapidly add incremental features to Claude Code and Cowork: Claude Cowork now has scheduled tasks, the “cron job” feature where you can get recurring tasks done automatically. Claude Code added an auto-memory system that maintains context across sessions. They also added a remote control feature to Claude Code that allow users to manage coding sessions from mobile devices, boost productivity by enabling control outside a primary environment.Cognition released Devin 2.2, “one of our biggest updates to Devin since launch.” It is 3 times faster, with a new interface and several autonomy upgrades:It’s now able to test its work with computer use, self-verify, and auto-fix its code.Figure 2. Devin 2.2 computer use with split pane terminals running in a sandbox, for testing and running generated code.Bottom-line: AI labs are rapidly iterating and improving their AI agents, adopting features from other AI agents.Google launched Nano Banana 2, also known as Gemini 3.1 flash-image preview, which delivers high-quality text-to-visual output on par with Nano Banana Pro but with better speed and efficiency. It has new image aspect ratios and delivers native 4K image synthesis in under 500ms. The model also integrates image search capabilities to reference external content during generation. It is accessible through the Gemini platform and app.Figure 3. Nano Banana Pro 2 can blend a large number of elements and image characters into a scene.Alibaba unveiled additional models in the Qwen 3.5 lineup that show remarkably strong performance, on par with Sonnet 4.5, in efficient smaller models:Qwen3.5-122B-A10B: A 122B model with only 10B active parameters that outperforms the Qwen 3 235B parameter flagship on long-context tasks.Qwen3.5-27B: A dense AI model suitable for use on local devices that supports 800K context length.Qwen3.5-35B-A3B: a hybrid-architecture MoE with only 3B active parameters that that rivals much larger dense models in reasoning and coding, making it an excellent balance of high performance and low computational overhead for local deployment.LM Studio debuted LMLink, enabling secure streaming of locally hosted AI model inference across other devices. This tool allows users to run models on personal hardware while remotely accessing them from other devices. It facilitates private, off-network AI deployment.Liquid AI announced LFM2-24B-A2B, a 24B parameter MoE model that operates with 2.3B active parameters, enabling it to run on consumer-grade laptops. This is their highest performance AI model yet, but they don’t provide benchmarks against non-LFM models, making it hard to assess competitiveness. It is available on HuggingFace.Perplexity released pplx-embed, a suite of state-of-the-art multilingual embedding models built on the Qwen3 architecture for web-scale retrieval tasks. These models are designed for high-quality vector representations of text for search, retrieval, and semantic applications. Utilizing bidirectional attention and diffusion-based pretraining, the models extract clean semantic signals from noisy data and are highly optimized for Retrieval-Augmented Generation (RAG) pipelines.OpenAI announced it would discontinue using SWE-bench Verified performance evaluations due to divergence from observed real-world capabilities and instead adopt SWE-bench Pro for future assessments. They found SWE-bench has benchmark flaws, and data contamination and “bench-maxxing” undermine the metric’s utility. The newer SWE-bench Pro matches performance metrics closer to real-world model performance.Confluence Labs announced a state-of-the-art score of 97.9% on the public ARC-AGI-2 benchmark. The Arc-AGI 2 benchmark is a suite of tasks designed to evaluate how well an AI system can learn and generalize with reasoning. The company’s breakthrough positions it at the top of this notoriously difficult evaluation, suggesting exceptional performance benefits of their approach.In a similar advance, Agentica claimed to have solved all publicly available “hard for AI” tasks of the Arc-AGI 3 benchmark. Solving 3 of the games in this benchmark indicates the platform is capable of high-level autonomous reasoning and problem solving complex tasks. If verified, this milestone would represent progress toward AGI. generalizable AI systems.Anthropic’s Claude Opus 4.6 reportedly achieved the capacity to complete tasks that take humans up to 14.6 hours to finish on the METR time-horizon benchmark at ~50% success. Opus 4.6’s performance suggests accelerating advances in long-horizon tasks for AI models. It also is hitting the upper bounds of the utility of the METR benchmark, as there are few long-range tasks in the metric.Bottom-line: AI models are saturating all the benchmarks.Figure 4. The METR benchmark shows remarkable progress; the task-length that an AI model can perform is doubling every 7 months and is accelerating.Sakana AI introduced Doc-to-LoRA and Text-to-LoRA, two hypernetwork methods that instantly adapt LLMs using zero-shot natural language descriptions. The architecture significantly reduces KV-cache memory usage by internalizing document context directly into the model weights, maintaining high accuracy on long-context information retrieval tasks at a fraction of the cost.OpenAI and Amazon revealed a major multi-year strategic partnership to bring OpenAI’s enterprise AI platform to AWS infrastructure, including stateful computing environments and custom models tailored for enterprise workloads. Amazon is investing an initial $15 billion (with a commitment up to $50 billion) to support joint AI innovation and deployment at scale.URL: https://openai.com/index/amazon-partnership/Anthropic claimed that rival AI labs from China launched large-scale distillation attacks using thousands of fabricated accounts to extract Claude model intellectual property from over 16 million interactions. The distillations – from DeepSeek, Moonshot, and Minimax - harvested output data from Claude to train competing AI models. Anthropic’s claims, while likely true, faced community blowback due to the fact that Anthropic has also trained their AI models on intellectual property of others.Earlier this week, the U.S. Department of War issued an ultimatum to Anthropic, demanding the removal of usage restrictions on its Claude AI models for military applications or else risk being designated a “supply chain risk” under potential Defense Production Act enforcement. At issue were guardrails that Claude AI models had in place restricting their use.Anthropic publicly declined to remove restrictions, specifically those preventing use of Claude for domestic surveillance and autonomous lethal weapon decisions. Claude has been deployed in classified government networks and the military, while also maintaining strict operational guardrails against these two specific military applications.In response, the Trump administration acted on their threat and ordered federal agencies to stop using Anthropic’s Claude models over the AI safety dispute. Anthropic CEO Dario Amodei criticized the Pentagon’s recent designation of the company as a “supply chain risk,” framing the move to restricts military contractor access to Claude as retaliatory. He wasn’t wrong; it was.At the same time, OpenAI secured a deal with Trump’s War Department to supply ChatGPT AI models. In an X post, Sam Altman stated:We also will build technical safeguards to ensure our models behave as they should, which the DoW also wanted. We will deploy FDEs to help with our models and to ensure their safety, we will deploy on cloud networks only.We are asking the DoW to offer these same terms to all AI companies, which in our opinion we think everyone should be willing to accept. We have expressed our strong desire to see things de-escalate away from legal and governmental actions and towards reasonable agreements.To add to confusion, this post got community-noted, saying the US Government will use their models for “all lawful purposes,” allowing for uses Anthropic insisted be explicitly barred. Did Sam Altman pull a fast one?Advanced Micro Devices (AMD) secured a massive multi-year supply agreement to provide Meta Platforms with next-generation AI infrastructure. Meta will deploy large-scale AMD Instinct GPUs starting in late 2026 to diversify its chip sourcing, establishing AMD as a major challenger in the AI hardware market.Jack Dorsey, CEO of Block, the parent of Square, announced layoffs of 4,000 employees, attributing the reductions to AI-driven automation that has reduced labor needs. However, some commentary suggests that it was simply a correction of prior over-staffing. AI will be both a real cause of job losses and a scapegoat for corporate cost-cutting going forward.Keying off the debate between Anthropic and Trump’s Department of War, Gary Marcus suggests a “Code Red for Humanity” over the severe risks associated with trusting generative AI in life-or-death military scenarios. For example, researchers observed that the tested models opted for nuclear escalation in 95% of simulated crises, highlighting the dangers of deploying unreliable systems without stringent human oversight.As AI becomes a tool of war, skepticism and concern is warranted. Anthropic did the right to stick to their guardrails and principles. But it may not matter if there is a literal AI arms race, using AI in dangerous ways because you fear an enemy might have an advantage if you don’t.

AI Week in Review 26.02.28

Other newsrooms on this story

Related reading

AI Week in Review 26.05.02

AI Week in Review 26.03.14

AI Week in Review 26.03.28

AI Week in Review 26.01.31

AI Week in Review 26.04.18

AI Week in Review 26.04.11