AI Week in Review 26.04.25

Figure 1. GPT Image 2.0 is a stellar image model that creates super-high-resolution images, generating this movie-poster rendition of this week’s theme and releases.OpenAI released GPT-5.5 and the even more powerful GPT-5.5 Pro, their most capable frontier models yet. The flagship GPT-5.5 and GPT-5.5 Pro, called the “Spud” model internally during training, take long-horizon agentic AI to new heights, with improvements in long-context reasoning, coding, spreadsheet and document work, computer-use tasks, and scientific research workflows.GPT-5.5 achieved SOTA results across several real-world benchmarks: 82.7% on Terminal-Bench 2.0, 84.9% on GDPval. It gets 58.6% on SWE-Bench Pro, below Opus 4.7’s 64.3%, but reviewers have noted this overlooks efficiency gains and tokenizer differences that make GPT-5.5 faster and more consistent, helping it perform at the highest level in production use cases.Also impressively, GPT-5.5 achieves high-level performance with significantly fewer output tokens than GPT-5.4, somewhat making up for its higher per-token cost. GPT-5.5 comes with a 1 million-token context window and advanced compaction for long-horizon tasks, with some testers reporting GPT-5.5 addressing complex code tasks autonomously for 7+ hours without stopping.Figure 2. GPT-5.5 shows more intelligence on a per-token basis than Claude Opus 4.7 or other frontier AI models on the Artificial Analysis Intelligence Index⁠, a weighted average of 10 benchmarks: AA-LCR, AA-Omniscience, CritPt, GDPval-AA, GPQA Diamond, Humanity’s Last Exam, IFBench, SciCode, Terminal-Bench Hard, τ²-Bench Telecom.Additionally, OpenAI is introducing a Bio Bug Bounty program, inviting researchers to expose biological weapons vulnerabilities in GPT-5.5, with monetary rewards for bugs found.To amplify OpenAI’s seriousness about winning enterprise users, OpenAI launched Workspace Agents in ChatGPT, a Codex-powered platform designed for enterprise users to automate business workflows such as software triage, lead outreach, and metrics reporting. These shared, cloud-based agents feature persistent memory and scheduled tasks, integrating with third-party applications such as Slack, Google Drive, and Salesforce. The system includes pre-built agents such as Tally for report generation and Scout for customer feedback management, while also allowing companies to build custom agents.OpenAI also introduced ChatGPT for Clinicians, a version of ChatGPT built for clinical use and free to verified U.S. physicians, NPs, PAs, and pharmacists to assist with clinical documentation and medical research. The announcement coincides with the release of HealthBench Professional, an open benchmark for evaluating AI performance in clinical chat tasks such as care consults and medical research.Figure 3. Conversation in Chat-GPT for Clinicians, discussing a patient medical issue.Finally, OpenAI launched GPT Image 2.0, a state-of-the-art AI image generation model that combines thinking and web search with image generation to outperform existing models. The model demonstrates a high level of precision in 3D modeling and text rendering, enabling it to generate functional blueprints, complex UI designs, and even legible text on small objects like a grain of rice. The thinking mode allows the model to research, plan, and reason through image generation using an agentic approach.Figure 4. Example output from GPT Image 2.0 shared by OpenAI showing detailed and accurate text rendering in multiple languages presented as a collage, expressing the various outputs it can generate.GPT Image 2.0 can generate up to eight consistent images from a single prompt. Edits are great too. I got the model to colorize a black-and-white movie still and restyle a low-res cartoon graphic into full-color cinematic photo-realistic scenes. It gets even more power by combining GPT-5.5 and GPT Image 2.0 for automated asset creation followed by graphical interface development or presentation generation, boosting what it can do for visual workflows.OpenAI was aiming for a work-oriented SOTA AI model to retake bragging rights from Anthropic Claude, and GPT-5.5 along with GPT-Image 2.0 hits the mark.Figure 5. Example output from GPT Image 2.0, showing photo-realism, faithful text rendering, and high resolution.DeepSeek released V4-Pro and V4-Flash preview models as The long-awaited DeepSeek V4 Pro and Flash are near-frontier AI models, with SWE-Bench Pro scores of 55.4% / 52.6% for Pro and Flash, respectively. DeepSeek V4 has a 1 million long context window, and it uses optimizations to cut KV-cache sizes by 10x, making long-horizon tasks faster and more efficient. More about the AI model optimizations is in DeepSeek V4 Technical Report.The efficiency of DeepSeek V4 makes this AI model the price-performance leader. DeepSeek V4-Flash costs only $0.14 / $0.28 for 1 M input / output tokens respectively, giving users Gemini 3.1 Pro-level capabilities at a Flash-lite price point. As open-source AI models available on HuggingFace or via third-party APIs, their cost-effective near-SOTA performance make them good AI models for AI agents like OpenClaw.Figure 6. DeepSeek V4 Pro-Max achieves benchmark scores close to Claude Opus 4.6. DeepSeek has also developed KV cache optimizations to cut KV cache sizes by 10x relative to DeepSeek v3.1.Moonshot AI released Kimi K2.6, an advanced open-source AI model for coding and long-horizon agentic tasks. It competes with leading proprietary AI models, scoring 58.6% on SWE-Bench Pro and 83.2% on BrowseComp. The model features significant upgrades including the ability to handle 12-hour plus coding sessions and an improved Kimi Agent Swarm feature to manage over 300 parallel agents for complex workflows.Alibaba launched a preview of its flagship Qwen 3.6 Max model, designed to function as a consistent autonomous agent for long-horizon practical tasks. The model achieves superior instruction following and improved real-world reasoning compared to the previous Qwen 3.6 Plus. Qwen 3.6 Max Preview has solid AI coding capabilities (57.3% on SWE-bench Pro), and overall benchmarks are comparable to Claude 4.5 Opus, making it near-frontier if not SOTA. While the preview is currently not open-source, it is available via API or Qwen chat.X.AI introduced grok-voice-think-fast-1.0, their new flagship voice model, which supports over 25 languages and holds the top position on the Tau-voice Bench leaderboard. It excels at complex, multi-step workflows and high-volume tool calling with low response latency.Google announced new AI-driven updates to Workspace at Google Cloud Next. They introduced Workspace Intelligence, an AI system that automates tasks using data from Gmail, Calendar, Chat, and Drive. A new Gemini integration “Gemini in Sheets” will enable prompt-based spreadsheet construction in Google Sheets, as well as automated text generation and refinement in Google Docs.Google at Cloud Next announced ‘auto browse’ agentic capabilities for Chrome. The new feature uses Gemini to automate tasks such as booking travel and inputting data by understanding the live context of open browser tabs. Google is also introducing “Shadow IT risk detection” to identify unsanctioned AI tools and is expanding its security partnership with Okta.X announces the launch of Grok-powered Custom Timelines. The feature uses Grok’s AI to build and personalize curated feeds for over 75 specific topics that can be pinned to the home tab. This rollout coincides with the shutdown of X Communities and is currently available to Premium X subscribers on iOS.Google AI Edge Eloquent is a new live AI transcription app. The app requires no subscription, has no usage limits, and filters out filler words like “um.” It is currently available on iOS, but Google plans to bring the app to Android and macOS.Microsoft is rolling out a new Copilot Agent Mode inside Office apps like Word, Excel, and PowerPoint this week. This feature allows Copilot to better follow complex instructions, execute multi-step edits, and show real-time progress via a sidebar. This mode is being released as the default experience for Microsoft 365 Copilot and Premium subscribers, as well as Personal and Family plans.Google released Deep Research and Deep Research Max agents. The new agents allow developers to fuse open web data with proprietary enterprise information via the Model Context Protocol and produce native charts and infographics. Built on the Gemini 3.1 Pro model, the release features a tiered architecture optimized for either low-latency interactivity or intensive, asynchronous reasoning.Anthropic introduced Claude Co-work Live Artifacts, a feature that allows users to create mini web apps and dashboards that update in real-time. Live artifacts automatically pull fresh data from connected applications like Gmail, Google Calendar, and Bitly. These live dashboards can act as a “daily command center,” categorizing urgent emails and tracking link performance without wasting Claude tokens on regeneration.New startup Noscroll launches AI-powered bot to replace doomscrolling. The bot service allows users to track specific interests through natural language interaction, then monitors social feeds, news sites, and other online sources to send personalized news digests via text.OpenAI has released a research preview of Chronicle, a feature that builds a memory of a user’s day-to-day work from screen snapshots to become more context-aware and helpful over time. Similar to Microsoft’s Recall, Chronicle is integrated into the Codex environment for Mac users only at this time. Although it is token-heavy, internal testers report that Chronicle assists their daily workflows by leveraging historical work context. However, this feature also raises serious privacy questions.Stanford researchers demonstrated that single-agent systems match or outperform multi-agent architectures on complex reasoning tasks when given the same thinking token budget. The paper “Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets” shows that single-agent systems are more information-efficient, and multi-agent systems only become competitive when a single agent’s effective context utilization is degraded, or when more compute is expended.Brown University researchers reported evidence that language models can develop a basic mathematical grasp of real-world plausibility. The study found that models can distinguish commonplace, improbable, impossible, and nonsensical events at a basic level, adding nuance to the debate over whether AI systems merely mimic language or build internal representations of the world.A new safety study found that stylized prompts can bypass LLM guardrails. Researchers tested “adversarial humanities” prompts that wrap dangerous requests in literary or theological styles, and compliance rates rose dramatically versus plain harmful prompts, raising concerns for agentic systems and safety testing.SpaceX has entered into a partnership with Cursor, securing an option to buy Cursor for $60 billion later this year. In exchange for a $10 billion investment by SpaceX, AI coding firm Cursor will gain access to the xAI (now part of SpaceX) Colossus supercomputer cluster to assist in training its specialized AI coding models.Anthropic is investigating unauthorized access to its Claude Mythos Preview model. A private online community was reportedly discovered to have had access to Mythos, which occurred after attackers used information from a third-party vendor breach at Mercor to guess the model’s location. The breach relied on an educated guess rather than a sophisticated technical exploit, but the model’s ability to automate complex cyberattacks presents security risks.The German central bank chief Joachim Nagel warned that Anthropic’s Mythos could create cybersecurity and market-access risks. He called for broader institutional access and stronger safeguards, warning that advanced coding and vulnerability-discovery capabilities could be misused in finance and other critical sectors.Anthropic recently removed “Claude Code” from its Pro subscription, then clarified it was for a small percentage of new sign-ups as part of a testing phase. The change was intended to manage heavy compute demands due to long-running agents and high-capacity chat features.The AI free ride is over, as Anthropic has begun restricting third-party AI agent tools like OpenClaw to manage system strain and pursue profitability and monetization strategies, but the changes have led to a backlash from users.Boehringer Ingelheim opened an AI and machine-learning center for pharmaceutical R&D in London, with a goal to strengthen the company’s use of AI across drug development, including trial recruitment, site selection, and regulatory workflows.Meta is reportedly tracking employee computer activity to train AI agents. Meta’s “Model Capability Initiative” records work-related mouse movements, clicks, keystrokes, and occasional screenshots from U.S.-based employees, creating a new internal-data pipeline for training workplace agents.A U.S. appeals court said lawyers should disclose when AI causes legal errors. The decision adds to pressure on attorneys and courts to manage AI hallucinations that lead to legal filings with false citations or other AI-generated mistakes.Google said 75% of its new code is AI-generated, up from 50% last fall. Google CEO Sundar Pichai announced the formation of a specialized unit to further enhance AI coding performance and use. The initiative seeks to catch up to Anthropic, which uses Claude Code to develop up to 90% of its code.

AI Week in Review 26.04.25

AI Week in Review 26.04.25

Other newsrooms on this story

Related reading

AI Week in Review 26.02.14

AI Week in Review 26.03.07

AI Week in Review 26.06.27

AI Week in Review 26.05.02

AI Week in Review 26.02.07

AI Week in Review 26.05.08

Other newsrooms on this story

Related reading

AI Week in Review 26.02.14

AI Week in Review 26.03.07

AI Week in Review 26.06.27

AI Week in Review 26.05.02

AI Week in Review 26.02.07

AI Week in Review 26.05.08