OpenAI released GPT-5.4 Thinking and GPT-5.4 Pro, flagship frontier AI models designed for advanced reasoning, coding, and agentic tasks. The models include a 1-million-token context window and native “computer use” capabilities, allowing the AI to interact directly with a desktop environment through screenshots, mouse clicks, and keyboard commands, not just text input-output. The release also includes new features in ChatGPT including a new /fast mode for lower-latency responses, mid-response interruption to steer outputs in real time, and tool search, which allows models to work efficiently when given many tools.GPT-5.4 Thinking and Pro achieve record-breaking performance on evaluations such as GDPval (83%), FrontierMath (47.6%), SWE-Bench Pro (57.7%), OSWorld (75%) and BrowseComp (82.7%). These benchmarks in some cases exceed human-level performance. GPT-5.4’s computer-use features and performance make the model excellent for both general-purpose and agentic tasks.It’s also proven to be powerful at extended reasoning. During early testing, a mathematician reported that GPT-5.4 successfully solved a research-level FrontierMath problem that had remained unsolved for about twenty years.Figure 2. OpenAI is SOTA on coding benchmark SWE-Bench Pro, computer-use related benchmarks such as BrowseComp, and virtual work benchmark GDPval. It’s designed for best-in-class performance on various AI agent workflows.Marketed as “designed for professional work,” OpenAI added deeper Excel integration and a new suite of financial-analysis tools to make GPT-5.4 even more useful for virtual work tasks. On an internal investment-banking benchmark covering workflows like financial modeling and research, GPT 5.4 scored 87%.Real-world reviews of GPT-5.4 are quite positive for agentic workloads and coding, but it’s far from perfect, getting the carwash logic question wrong and being slow and weaker at writing than competing AI models. It’s also expensive: GPT‑5.4 API costs $2.50 in and $15 out per million tokens, higher than previous versions, while GPT‑5.4 Pro costs an impractical $30 / $180 per million input/output tokens.OpenAI released GPT-5.3 Instant, an update to its lightweight version of GPT-5 to improve conversational flow, relevance, and tone to improve the daily user experience in ChatGPT. OpenAI called it “more contextual” and less “cringe,” with more direct answers and less caveats, refusals, and moralizing preambles. Optimized for speed and cost efficiency in conversational use cases, this model will be deployed widely for free-tier users as the default assistant model.Google launched Gemini 3.1 Flash Lite, high-speed (282 tokens per second) model offering cost-efficient performance. Gemini 3.1 Flash Lite includes adjustable reasoning levels (minimal to high), code execution, and grounding with Google Search, and it outperforms Gemini 2.5 Flash and Claude 4.5 Haiku on several benchmarks (1432 ELO on LM Arena), with particular strengths in multimodal reasoning (76.8% on MMMU). Its high-speed performance and lower API pricing of $0.25 / $1.50 per million input / output token make it a useful choice for high-throughput AI applications, such as automated image description and summarization.Alibaba Qwen has added to their Qwen 3.5 lineup, releasing the Qwen 3.5 Small Model Series of 0.8B, 2B, 4B, and 9B parameter multimodal AI models aimed at efficient performance and edge device deployment. Based on native multi-modal Qwen3.5 foundation, these models have early-fusion multimodal architecture and training, support for tool use, and up to 262K tokens of context.This lightweight open Qwen 3.5 models have excellent performance for their size in vision, reasoning, coding, and agent uses. They are available via GitHub and Hugging Face.Figure 3. The Qwen3.5 9B and Qwen3.5-4B models have excellent performance for their size on multimodal tasks.StepFun released several Step 3.5 AI models built around a sparse Mixture-of-Experts architecture: Step-3.5-Flash-Base and Step-3.5-Flash-Base-Midtrain, open (Apache 2.0) base models with 256K context window, and 196.8B total parameters with 11B active parameters. The mid-train checkpoint is positioned for stronger code and agent workflows, extending the release beyond just inference weights. The AI model releases were accompanied by a release of the open SteptronOSS training framework. This openness supports developers wanting to fine-tune from the Step 3.5 base foundation AI model.Chinese AI lab Yuan AI launched Yuan 3.0 Ultra, an open frontier multimodal AI model built on a Mixture-of-Experts (MoE) architecture and designed for enterprise workflows. Yuan 3.0 Ultra has 1T total parameters with 68.8B activated, supports multimodal inputs, and curbs AI “overthinking” with RIRM (Reflective Inhibition Reward Mechanism). The open source model is free for commercial use, with model access through its site and Hugging Face.xAI updated its Grok 4.20 to version beta 2. This release includes improvements to instruction following, reduced hallucinations, enhanced precision in scientific text processing, and improved reliability of image searches.Cognition, the makers of Devin coding agent, published an early preview of SWE-1.6, an updated coding model in its SWE family focused on software engineering tasks. SWE-1.6 is post-trained on the same base as SWE-1.5, and scores 11% higher than SWE-1.5 on SWE-Bench Pro in the company’s reported evaluation. The research update is a preview of a strong agentic coding model that could be quite fast; it runs at 950 tokens per second.Figure 4. Cognition’s coding-specialized AI model SWE-1.6 is comparable to Claude Opus 4.5 on SWE-Bench Pro.OpenAI’s Codex desktop app was launched on Windows, extending its agentic coding app beyond macOS. The Codex desktop app manages software development workflows powered by OpenAI’s coding agents and can handle multiple long-running coding tasks simultaneously. The Windows Codex release runs both natively and in WSL, with terminal support for PowerShell, Command Prompt, Git Bash, or WSL, and comes with a Windows-native secure agent sandbox for deploying code safely.Google released the Google Workspace CLI, a command-line interface that connects AI agents and developers to Google Workspace services such as Google Drive, Gmail, Calendar. As VentureBeat explains, Google Workspace CLI provides more structured, reusable, and token-efficient access to Workspace services for AI agents. It ships with over 100 Agent Skills, plus higher-level helpers and curated recipes for common workflows. This Google release is similar to third-party utilities like Gog, used in OpenClaw, and it validates CLIs as an efficient AI tool integration mechanism.Google added a Cinematic Video Overviews feature to NotebookLM that creates animated motion graphics and realistic AI-generated video content based on uploaded documents. The feature is powered by Gemini3, Nano Banana Pro, and Veo3 and is available to Google AI Ultra subscribers. NotebookLM also added 10 custom infographic styles users can choose from.Google made Canvas in AI mode search available for all users in the United States. The Canvas interface allows users to view and interact with generated code or visual outputs in a window, matching similar workspace interfaces in Claude and ChatGPT.Anthropic released a data migration tool allowing users to import their chat history and preferences from other AI providers. This feature aims to reduce switching costs for users moving from platforms like ChatGPT to Claude. Additionally, Anthropic expanded its “memory” features to include users on its free tier.OpenAI released Symphony, an orchestration framework for managing autonomous coding work released as an open-source (Apache-2.0 license) GitHub project. The repository describes Symphony as a system that turns project work into “isolated, autonomous implementation runs” so teams can manage work instead of directly supervising coding agents:Symphony works best in codebases that have adopted harness engineering. Symphony is the next step -- moving from managing coding agents to managing work that needs to get done.Microsoft Research released Phi-4-reasoning-vision-15B, a compact 15B parameter open-weight multimodal reasoning model. It is designed to balance computational efficiency with high-level performance in math, science, and user interface understanding. They shared their training and architecture learnings on the model in “Phi-4-reasoning-vision-15B Technical Report” and open weights on HuggingFace for local model use.Anthropic published “Labor market impacts of AI: A new measure and early evidence,” a labor market report that introduces a method for estimating how AI is affecting hiring and work. Anthropic’s analysis combines theoretical task exposure with observed work-related Claude usage to measure labor-market effects. The report says there is not yet evidence of economy-wide displacement, but it does find that jobs with higher AI exposure are seeing weaker hiring trends, especially for earlier-career roles.Anthropic has been designated a “supply chain risk to America’s national security” by the Department of War, and Anthropic plans to challenge the action in court. The company said the designation followed a dispute over whether its models could be used without Anthropic’s existing safeguards for certain national-security uses. Anthropic also said the scope is narrower than a blanket ban and is tied to the department’s contracting requirements.Meta is creating a new Applied AI Engineering organization, bridging gaps among tooling, infrastructure, and model-development to help turn research advances into deployable systems. The group has a notably flat management structure and will work closely with Meta’s superintelligence team.U.S. officials are considering limiting Chinese customers to 75,000 Nvidia H200 chips each, with AMD’s MI325 chips counted under the same rule. The proposal would also maintain an aggregate cap of one million such accelerators into China, sharply constraining the size of AI-training clusters Chinese firms could assemble.Apple announced new MacBook Air and MacBook Pro systems built around the M5 family, offering strongest on-device AI performance, claiming 4x AI performance over the prior generation. Apple said the M5 MacBook Air includes “a Neural Accelerator in each core,” while the MacBook Pro with M5 Pro and M5 Max adds “Neural Accelerators in the GPU” for running advanced AI workloads locally.Amazon is exploring technology that would let other apps and websites sell ads inside chatbot interfaces. Amazon has discussed the idea with companies including ad-tech and publishing partners, which would extend its advertising business into AI-native conversational surfaces.me stepping down. bye my beloved qwen. - Junyang LinQwen lead Junyang Lin abruptly announced his departure from Qwen team in a public social media post this week. The situation reportedly was due to team unhappiness in the Qwen team over lack of resource support; it escalated enough that senior Alibaba leadership intervened to address the dispute, but not satisfactorily.The announcement triggered widespread AI community discussion about the future of the Qwen team and its models. Junyang Lin has been described as the linchpin of the Qwen team, a highly productive team that has released over 400 models in 3 years. Xingyao Zhang (Leo) on X expressed it well:Shipped Qwen 3.5 small models that run on a phone with 1GB RAM, got Elon to call it ‘impressive intelligence density,’ and walked out. That’s how you exit. Whatever’s next, good luck.The best AI leaders seem to make a positive impact no matter where they land, so we wish Junyang Lin all the best wherever he takes his AI leadership.
AI Week in Review 26.03.07
GPT-5.4 Pro, GPT-5.3 Instant, Gemini 3.1 Flash Lite, Qwen 3.5 - 0.8B 2B 4B 9B, Step 3.5 Flash Base, Yuan 3.0 Ultra, Grok 4.20 beta 2, Cognition SWE-1.6, Codex desktop for Win, Phi4-vision-reasoning.







