跳到主要内容

7 篇博文 含有标签「insights」

查看所有标签

Taking the Pulse of Agentic AI from the Developer Community at the End of Q1 2026

· 阅读需 14 分钟
Xia Xiaoya
Senior Researcher

Today, I want to share some observations on the Agentic AI ecosystem from the vantage point of 2026's first quarter—technical trends read from popular projects, portraits of AI developers, and the subtle relationship between developers and AI tools. This is not meant to be comprehensive; we welcome the community to share more observations and reflections.


Agentic Ecosystem in 2026

This year, everyone seems to be in a state where FOMO and excitement intertwine. There's a sense that AI application deployment has reached an unprecedented acceleration point—perhaps even a tipping point. But is this tipping point real or emotionally amplified? Let's calibrate our intuition with two metrics.

This chart shows the top 20 projects by OpenRank last month and the top 20 by Star growth this year—the most active and most-watched projects. I've highlighted LLM-related projects, and unsurprisingly, OpenClaw occupies the #1 and #2 spots on both lists.

Developer attention has completely flowed toward the Agent ecosystem, although the Star count list includes many awesome-collection type projects (which naturally attract more attention). Just looking at the project names, you can feel they're permutations of a few words: OpenClaw, Skills, Claude, Claude Skills, OpenClaw Skills. The actual developer effort reflected in activity metrics is somewhat more honest, but even so, LLM-related projects account for about 40%.

Expanding the scope to the top 1000 most-watched repositories, after rough labeling, we can see 81% are Agent-related. The most frequently tagged keywords in project Topics are: Agent, Claude, LLM, Code, Skill.

Looking back over the past few years, you can feel the rotation of technological ecosystem dominance from the naming of popular projects emerging at different stages. Popular projects created around 2023-2024 were mostly related to GPT and Llama, such as AutoGPT, MetaGPT, Ollama, llama.cpp. As time turns, there are always technologies that serve as unavoidable coordinates. In 2025, that coordinate was called Claude Code, and thus projects like Clawdbot (later OpenClaw) and Claude-Mem emerged.

Based on the currently most popular and active projects, we've compiled the latest map of the Agentic AI ecosystem, covering about 50+ projects. Many should look familiar, while some are new faces. Let's follow a few specific projects to examine current technical trends.


From Context Management to Complexity Harness

The optimizations we made under the capability constraints of the foundation models were essentially about managing information in the model's attention window: feeding more effective prompts to the model, invoking tools like browsers, connecting external background knowledge the model needs (RAG), and maintaining memory across multi-turn conversations. This path accumulated into a practice called "Context Engineering."

Claude-Mem and Context7 are two open-source tools created around mid-last-year, each now with tens of thousands of Stars. They each found interesting entry points, but essentially solve the same thing: telling the model more effective background knowledge—and making sure it doesn't forget.

Claude-Mem is a Claude Code plugin that compresses all conversation outputs during Claude Code's task execution using a model, providing them as context for future conversations to ensure the Coding Agent has longer conversation memory.

Context7 provides both MCP service and Skill loading modes. Every time a task is executed, it fetches the latest documentation of involved dependency libraries to ensure the Coding Agent doesn't execute outdated code.

But "Context Engineering" as a term is starting to feel insufficient this year, because the problem is no longer just "is there enough information," but "will the Agent lose control?" Developers have likely experienced this: during autonomous task execution, the Agent either crashes the entire system or stops halfway without saying anything.

Oh-My-OpenAgent (formerly oh-my-opencode, a plugin for OpenCode) calls itself the "strongest Agent Harness" in its project description. It built a continuous execution Enforcer called "Sisyphus": as long as TODO tasks aren't complete, it forces the Agent to keep restarting or finding new paths until 100% achievement—like Sisyphus endlessly pushing the stone up the mountain.

So I understand Harness as providing background knowledge while further constraining the Agent's behavioral boundaries—not just letting the Agent know "what is," but making clear "what it can touch" and "what it can't," and knowing what to do when stuck. Context Engineering manages input quality; Harness Engineering manages execution discipline.


Software Development Shifts from Human-Centric to Agent-Centric

This trend can already be felt from the projects above: these newly emerging tools are designed not to serve developers, but with the Agent as the execution subject. Interestingly, what humans have accumulated in software development practices is now migrating to Agents. Developers need to consult the latest documentation—so do Agents; developers need to collaborate in teams—Agents are starting to need that too.

Vibe-Kanban brings traditional task boards to the Agent team collaboration scenario, turning it into the Agent's command center. Each task creates an entry with clear acceptance criteria (AC) on the board. Agents execute against AC, while human engineers do task preview and Diff Review through an integrated UI. This is essentially a Harness too—just constraining not individual Agent execution behavior, but the entire development process.

A fitting analogy: model-driven code generation is a powerful but directionless horse; Harness is the equipment composed of constraints, guardrails, and feedback mechanisms; humans are riders, responsible for giving direction, not running themselves.


The Agent "Evolution" Proposition—Lobsters, Cats, and Bees

Agents are clearly no longer satisfied with fixed process orchestration—self-evolution is the new proposition. OpenClaw started the "raising lobsters" trend first, and soon a new batch of cats and lobsters appeared. These projects, inspired by OpenClaw, each made tradeoffs in different dimensions.

nanoclaw was launched in late January 2026 by indie developer Cohen, built entirely on Anthropic Claude Agent SDK with a core engine of about 4000 lines of code. Its design philosophy is security-first—all Agents run in isolated containers, using Apple Container on macOS and Docker on Linux, with Bash commands running in containers rather than on the host machine. Andrej Karpathy specifically mentioned it on social media: "The codebase is small enough that both I and AI can understand it, so it feels manageable, auditable, and flexible." This sentence precisely captures what this batch of lightweight frameworks is betting on: understandability itself is a security guarantee.

nanobot goes even more extreme. From HKU's Data Intelligence Lab (HKUDS), about 4000 lines of Python code—99% less than OpenClaw. It strips away all non-core modules, keeping only the ReAct reasoning loop, tool calling, and message queue. It even removed the litellm external dependency in subsequent versions, switching to native SDK for direct model connection—the shorter the supply chain, the smaller the risk.

CoPaw takes the opposite approach. Open-sourced by Alibaba Cloud's AgentScope team, it takes the feature-complete route. Built-in active heartbeat mechanism—not just passively responding to user messages, but proactively triggering tasks at set times. Memory is stored locally, with user preferences and historical tasks continuously accumulating. Supports DingTalk, Feishu, Discord, iMessage, and other channels, with a continuously expanding Skills ecosystem. If nanoclaw and nanobot are doing subtraction, CoPaw is seriously answering "what a complete personal AI assistant should look like."

Early this year, another open-source framework named Aden Hive appeared, answering a deeper question: Can the orchestration framework itself self-evolve?

The fundamental difference from traditional frameworks like LangChain and AutoGPT isn't in functionality, but in that it doesn't require developers to predefine agent execution flows. Its approach: describe goals in natural language, have a Coding Agent (Queen Bee) generate the Agent execution graph and connection code; once running, if failures occur, the framework captures failure data and calls the Coding Agent again to analyze causes, modify structure, and redeploy. This closed loop requires no human intervention. This is a serious bet on generative orchestration. It bets that task complexity often can't be predefined—rather than exhaustively enumerating all cases at design time, let the system continuously grow from feedback during real execution.

Whether Agents as personal assistants or Agent orchestration frameworks themselves, self-evolution is transitioning from a bonus feature to a design starting point.


Model "Big Three" Each Build Complete Ecosystem Tools

The top model companies are each laying out their open-source ecosystem tools and standards. MCP, Skills, Agents.md—one after another they land, and third-party tools can barely keep up digesting them.

An interesting phenomenon is the blurring boundary between Coding Agent and General Agent. After ChatGPT appeared, people searched for a long time before finding viable landing scenarios beyond Chatbot—Coding was among the first to be validated. But when tools like Claude Code reach a certain level, they naturally expand outward, not wanting to just be code-writing tools. OpenClaw was born under this expectation—using the IM window people are most familiar with as a carrier, attempting to carry more general Agent capabilities.


Project Story: One-Person Company? Zero-Person Company!

Just as the OPC (One Person Company) concept was being hotly discussed, a project called Paperclip that appeared in early March has pushed this further. The concept it's hyping: Zero-Person Company. In just over 20 days, Stars grew from 0 to 40,000.

Paperclip's positioning is very direct:

"If OpenClaw is an employee, Paperclip is the company."

Its usage logic has three steps: set goals, recruit a team, press start.

The goal could be "grow this AI note-taking app to $1M monthly revenue"; the team could be Claude as CEO, Cursor as CTO, Codex for engineering, OpenClaw for marketing; once started, this company begins running itself.

Even more interesting is its governance design. Agents can't hire new Agents themselves—this needs your approval; CEO can't unilaterally execute strategies—needs your confirmation. Paperclip positions you as the board—you can pause, override, reassign, or terminate any Agent at any time. Autonomy is a privilege you grant, not an Agent's default power.

In the OPC era, one person can do many things. But the question Paperclip is asking: if even that "one person's" execution work can be outsourced to Agents, what role remains for you? Probably just one word: Board.


The AI Era's "Developers and AI"

Having covered projects, let's look at the other side: the people behind these projects.

Developers: Concentrated in Head Projects, But from Diverse Backgrounds

In February 2026, across the top 50+ Agentic projects, there were approximately 21,000 independently active developers. But the “21,000” figure is somewhat misleading, because they are not evenly distributed across these projects: active developers in OpenClaw and Claude Code alone account for nearly half of the total.

Activity distribution is similarly highly concentrated. This is the familiar power law phenomenon in open-source communities, but it's particularly extreme in this ecosystem: top developer activity scores reach 81, while 95% of developers have activity under 1—a minority driving most substantive progress.

There are several noteworthy numbers in these developers' background composition. Among the 4,232 developers who filled in company information, those from big companies like FAANG and BAT account for less than 10%. More are independent developers and startup people—this ecosystem is not currently dominated by big company engineers.

Geographically, among the 6,295 developers who filled in country information, US developers account for 30%, and Chinese developers account for 10%.


Developers: Young and Cross-Disciplinary, "Builders," "Founders," and "Digital Nomads"

We focused on the top 100 most active developers. They're significantly younger, or at least arrived at the developer community later—the median account creation time is January 2018. If you include long-tail developers, the median becomes December 2013. These two numbers together tell us one thing: a significant portion of top active contributors entered the developer community after the Kubernetes era, and their technical intuition and infrastructure cognition differ noticeably from cloud-native veterans.

Even more extreme: among the 100, one-quarter (25 developers) registered GitHub after 2023, meaning they started coding only after LLMs truly went mainstream. ComfyUI author comfyanonymous and Aden Hive author RichardTang-Aden are among them. They're not developers "changed" by the AI wave—they're developers "summoned" by it. Before this, they might not have considered themselves developers at all.

Here are several representative developers. In their self-descriptions, they are designers, musicians, self-taught developers, prompt engineers, hackers, and digital nomads. Their commonality isn't technical background—it's that verb: build.


Developers and AI: Replacement or Symbiosis? Let's Look at the Numbers

This question is hard to answer directly, but numbers can provide clues. Searching GitHub for Claude-attributed Commits yields over 20 million results. Using the same search method: Cursor about 1 million, Copilot 700K, Gemini 450K, Codex even lower. The difference between Claude and others is a full order of magnitude.

Of course, this data has obvious limitations—this is fuzzy search, and many AI-participated Commits won't be attributed at all, and attribution habits vary by tool and team culture. But even discounting, this order-of-magnitude difference still tells us one thing: Claude-series tools are embedded quite deeply in actual code submission pipelines.

Beyond code generation, Review is another link being taken over by Agents. Copilot and CodeRabbit have completed hundreds of thousands of code Reviews in less than three months this year. The significance of this number isn't just scale, but that Review was previously considered highly dependent on human judgment—it requires understanding context, intent, and team norms. How well Agents can do this is still hard to determine, but they're already doing it.

Among all Agent landing scenarios, Coding is one of the few that has truly completed commercial validation. Other scenarios are still telling stories; Coding Agents are already collecting money.


2026 Coding Agent Landscape: Prompting, Generation, Review, to Requirements Management

We've compiled a landscape of currently popular Coding Agents. The code completion stage is basically past tense, but Copilot is still holding on. While it can't match Claude at writing code, as GitHub's native AI collaboration tool, it's still leading in code review.

Due to time constraints, we didn't do deeper research this time. There's an interesting question: do PRs using Review Agents get merged significantly faster than those without? Intuitively yes, but "significantly" to what extent, and in what types of projects is it most obvious—this deserves serious data analysis.

The more interesting part of the landscape is that some projects are already exploring earlier stages of the software development lifecycle—requirements management. Besides the aforementioned Vibe Kanban, Dane in the Mastra project is another fascinating bot. It can connect to various community channels—Slack, Discord, or mailing lists—extract or abstract project requirements from discussions, and directly file Issues in repositories.


Finally: Amidst AI FOMO, Openness, Sharing, and Collaboration Remain Developers' Spiritual Home

👆This sentence is a personal feeling written at the end.

Peter Steinberger is a tireless open-source builder and creator in the AI era. Before OpenClaw, he had already open-sourced 50+ projects. OpenClaw rekindled everyone's enthusiasm in this exhausted era, largely because it's an open-source project—not just spiritually, but because open-source means it can run locally, means data has some degree of privacy, means you can optimize or fork the project.

Under the AI FOMO wave, models iterate, products iterate, funding iterates. But openness, sharing, and collaboration have never truly gone out of style in the developer community. This is perhaps one of the few things in this ecosystem that doesn't need to wait for "the next version."

The Community Stories of vLLM and SGLang

· 阅读需 7 分钟
inclusionAI
Ant Group

Originally published on Medium by Ant Open Source.

First, what is LLM inference?

Training large language models (LLMs) attracts attention for its massive compute demands and headline-making breakthroughs; however, what ultimately determines their real-world practicality and broad adoption is the efficiency, cost, and latency of the inference stage. Inference is the process by which a trained AI model applies what it has learned to new, unseen data to make predictions or generate outputs. For LLMs, this means accepting a user prompt, computing through the model's vast network of parameters, and ultimately producing a coherent text response.

The core challenge in LLM inference is deploying models with tens to hundreds of billions of parameters under tight constraints on latency, throughput, and cost. It is a complex, cross-stack problem spanning algorithms, software, and hardware. Among open-source inference engines, vLLM and SGLang are two of the most closely watched projects.

From academic innovation to a community-driven open-source standard-bearer

vLLM and SGLang history

vLLM traces its roots to a 2023 paper centered on the PagedAttention algorithm, "Efficient Memory Management for Large Language Model Serving with PagedAttention." vLLM's breakthrough wasn't a brand-new AI algorithm; instead, it borrowed paging and cache management ideas from operating systems to fine-grain memory management, laying the groundwork for high-throughput request handling via its PagedAttention mechanism. vLLM also embraced and advanced several industry techniques, such as Continuous Batching first described in the paper "Orca: A Distributed Serving System for Transformer-Based Generative Models."

vLLM star growth

(Source: star-history)

vLLM delivered striking gains: compared with a Hugging Face Transformers–based backend, vLLM handled up to 5× the traffic and boosted throughput by as much as 30×. Within less than half a year it amassed tens of thousands of stars; today, over ten thousand contributors have engaged in issue/PR discussions, nearly 2,000 have submitted PRs, and on average at least 10 new issues are filed daily.

SGLang originated from the paper "SGLang: Efficient Execution of Structured Language Model Programs" and opened new ground with a highly optimized backend runtime centered on RadixAttention and an efficient CPU scheduling design. Rather than discarding PagedAttention, RadixAttention extends it: it preserves as much prompt and generation KV cache as possible and attempts to reuse KV cache across requests; when prefixes match, it slashes prefill computation.

Community metrics for vLLM and SGLang

(Current community metrics for both projects, data as of August 22, 2025)

Community-wise, SGLang is a fast-rising newcomer with a leaner footprint — its total contributor count is less than half of vLLM's. Most issues in vLLM receive responses within 12 hours to 3 days, whereas in SGLang it typically takes 3 to 5 days.

Origins: a continuous current of innovation

As a leading U.S. public research university, UC Berkeley has produced a remarkable roster of open-source projects: Postgres in databases, RISC-V in hardware, Spark in big-data processing, and Ray in machine learning. Early core initiators of the two projects — Woosuk Kwon (vLLM) and Lianmin Zheng (SGLang) — both hail from Berkeley and studied under Ion Stoica, the luminary who led students to create Spark and Ray.

vLLM led the way with an open-source release in June 2023; SGLang debuted roughly six months later. In 2023, Lianmin, Stanford's Ying Sheng, and several scholars founded the open research group LMSYS.org, launching popular projects such as FastChat, Chatbot Arena, and Vicuna.

Today, core initiators Woosuk and Lianmin remain actively involved. Recent six-month contributor data show that early-career academic researchers remain a major force. Beyond academia, vLLM's contribution backbone includes Red Hat, while SGLang's core contributors come from xAI, Skywork, Oracle, and LinkedIn.

Top contributors by organization

As many as 194 developers have contributed code to both vLLM and SGLang — about 30% of SGLang's total code contributors to date. Notable cross-contributors include:

  • comaniac (OpenAI): 17 early PRs to SGLang + 77 PRs total to vLLM
  • ShangmingCai (Alibaba Cloud Feitian Lab): 18 PRs to vLLM, then shifted focus to SGLang with 52 PRs
  • CatherineSue (Oracle): 4 bug-fix PRs to vLLM, then 76 PRs to SGLang as a core contributor

Development, refactors, and fierce competition

Key milestones from an OpenRank perspective:

  • June 2023: vLLM officially launches, introduces PagedAttention, and grows quickly.
  • January 2024: SGLang ships its first release and gains industry attention thanks to RadixAttention.
  • July 2024: SGLang releases v0.2, entering its first acceleration phase.
  • September 2024: vLLM ships v0.6.0, cutting latency ~5× and improving performance ~2.7× via CPU-scheduling and other optimizations; SGLang releases v0.3 the day before.
  • December 2024–January 2025: vLLM unveils the V1 refactor. With DeepSeek V3/R1 bursting onto the scene, both begin a second wave of explosive growth.

OpenRank comparison

In 2024, as features and hardware support expanded rapidly, vLLM hit classic software-engineering headwinds. A third-party performance study published in September showed that in some scenarios vLLM's CPU scheduling overhead could exceed half of total inference time. The official blog acknowledged the need for a foundational refactor: V1 arrived in early 2025, after which growth re-accelerated.

CPU scheduling overhead comparison

CPU scheduling overhead: vLLM (left) vs. SGLang (right)

In 2025, the performance race among inference engines heated up. Recognizing the limits of "number wars," both gradually shifted to reproducible methods and end-to-end metrics. A recent third-party comparison from Alibaba Cloud benchmarking vLLM vs. SGLang on the Qwen family showed overall single-GPU/dual-GPU results favoring SGLang, though outcomes vary across hardware/models/configurations.

Trend-wise, model architectures are showing signs of convergence. Leaders vLLM and SGLang now both support Continuous Batching, PagedAttention, RadixAttention, Chunked Prefill, Speculative Decoding, Disaggregated Serving, and CUDA Graphs; operator libraries like FlashInfer, FlashAttention, and DeepGEMM.

Other inference engines to watch:

  • TensorRT-LLM: launched by NVIDIA in late 2023, deeply tuned for its own hardware
  • OpenVINO: developed by Intel, focused on efficient deployment across Intel CPUs/GPUs
  • Llama.cpp: written in C++ by Georgi Gerganov in 2023, targets low-barrier edge inference
  • LMDeploy: co-developed by the MMDeploy and MMRazor teams, with dual backends — TurboMind for high performance and PyTorch for broad hardware coverage

Moving Forward in the Ecosystem

During their rapid-growth phase, both vLLM and SGLang drew attention from investors and open-source foundations:

  • In August 2023, a16z launched the Open Source AI Grant, funding vLLM core developers Woosuk Kwon and Zhuohan Li. In a later cohort, SGLang core developers Ying Sheng and Lianmin Zheng were also funded.
  • In July 2024, ZhenFund donated to vLLM, and LF AI & Data announced vLLM's entry into incubation; this year vLLM was moved under the PyTorch Foundation.
  • In March 2025, PyTorch published a blog post welcoming SGLang to "the PyTorch ecosystem."

Both projects have become go-to inference solutions globally, with active participation from engineers at Google, Meta, Microsoft, ByteDance, Alibaba, Tencent, and other companies.

Developer company backgrounds (issues)

(Data Source: ossinsight) Company Backgrounds of Developers Submitting Issues

Today, roughly 33% of vLLM's contributors are based in China, and about 52% for SGLang. Both communities host regular in-person meetups in Beijing, Shanghai, Shenzhen, and other cities worldwide.

Open Source LLM Development Landscape 2.0: 2025 Revisited

· 阅读需 9 分钟
inclusionAI
Ant Group

Originally published on Medium by Ant Open Source.

Just over three months ago, Ant Open Source and InclusionAI jointly released the very first Open Source LLM Development Landscape, along with a trend insights report. Our goal was simple: to highlight which projects in this fast-moving ecosystem are most worth tracking, using, and contributing to.

That's why we're excited to unveil the 2.0 release of our Landscape — a refreshed view of the ecosystem, built with even more insights and context. With the 2.0 release, we also refreshed our methodology for mapping the ecosystem, surfacing a wave of previously overlooked projects while removing others that didn't make the cut.

Open Source LLM Development Landscape 2.0

Open Source LLM Development Landscape 2.0: https://antoss-landscape.my.canva.site/

The updated landscape is organized into two major directions: AI Infra and AI Agents. Drawing on community data, we identified and included 114 of the most prominent open source projects, spanning 22 distinct technical domains.

For 2.0, we shifted to using the global GitHub OpenRank rankings directly. From the top down, we filtered projects by their descriptions and tags to identify those belonging to the LLM ecosystem, and gradually refined the scope. Only projects with an OpenRank score of 50 or higher are included.

Note: By installing the HyperCRX browser extension, you can view an open-source project's OpenRank trend in the bottom-right corner of its GitHub repository page.

Compared with the 1.0 release, this new 2.0 Landscape brings in 39 fresh projects — about 35% of the total list. On the other hand, 60 projects from the first version have been dropped, mostly because they fell below the new bar. Even if we include the dropped projects, the median "age" of all projects is just 30 months — barely two and a half years old. 62% of these projects were launched after the "GPT moment" (October 2022).

These projects have drawn participation from 366,521 developers worldwide. Among those with identifiable locations, about 24% are based in the United States, 18% in China, followed by India (8%), Germany (6%), and the United Kingdom (5%).

Global Developer Contribution

Across the 170+ open source projects covered in both Landscape versions, we observed over 360K GitHub accounts engaging through issues or pull requests. Among these, we identified 124,351 developers with parseable location data.

Overall, U.S. accounts for 37.4% of contributions, with China at 18.7%, putting their combined share above 55%. Germany drops sharply to 6.5% in third place.

Top 10 Countries by Contribution

Top 10 Countries by Contribution in Open-Source LLM Ecosystem

Looking across technical fields:

  • In AI Infra, U.S. and China account for over 60% of contributions
  • In AI Data, participation is more globally distributed, with several European countries ranking in the global top 10
  • In AI Agents, U.S. and Chinese developers contribute 24.6% and 21.5% respectively

Large Models Landscape 2025

Outside of the open source development ecosystem, large models themselves are being released at a rapid pace. A few interesting observations:

  • MoE Takes Center Stage: Flagship models like DeepSeek, Qwen, and Kimi have all adopted Mixture of Experts (MoE) architecture — sparse activation enabling trillion-parameter giants like K2, Claude Opus, and o3.
  • Reinforcement Learning Boosts Reasoning: DeepSeek R1 combines large-scale pretraining with RL-based post-training, making reasoning the signature feature for flagship model releases in 2025. Series like Qwen, Claude, and Gemini have begun integrating "hybrid reasoning" modes.
  • Multimodality Goes Mainstream: Most 2025 releases focus on language, image, and speech interaction, though specialized vision-only and speech-only models have also emerged.

Large Models Development Keywords

We extracted keywords from the GitHub descriptions and topics of every open-source project in the Landscape. The most frequent keywords are: AI (126), LLM (98), Agent (81), Data (79), Learning (44), Search (36), Model (36), OpenAI (35), Framework (32), Python (30), MCP (29).

Keyword cloud

Top 10 Open Source Projects

The top 10 projects by OpenRank span nearly the entire chain: from foundational compute and frameworks like PyTorch and Ray, to training data pipelines such as Airflow, to high-performance serving engines like vLLM, SGLang, and TensorRT-LLM. On the application side: Dify, n8n, Gemini CLI, and Cherry Studio.

Top 10 by OpenRank

Note: All data is as of August 1, 2025

Looking at the forces behind these projects:

  • Academia: Projects like vLLM, SGLang, and Ray emerged from UC Berkeley's labs under Ion Stoica
  • Tech giants: Meta, Google, NVIDIA hold or shape critical positions in the stack
  • Indie teams: Smaller teams like Dify and Cherry Studio are innovating rapidly near the application layer

Redefining Open Source in the LLM Era

Veterans familiar with open source licensing might feel alarm when looking at licenses adopted by today's top projects. While most projects still rely on permissive licenses like Apache 2.0 or MIT, several high-profile cases stand out:

  • Dify's "Open Source License": Based on Apache 2.0 but restricts unauthorized multi-tenant operation and prohibits removing logos/copyright notices.
  • n8n's "Sustainable Use License": Allows free use and modification but restricts commercial redistribution.
  • Cherry Studio's "User-Segmented Dual Licensing": AGPLv3 for ≤10-person orgs; commercial license required for larger orgs.

At the same time, GitHub has evolved into a stage for product operations. Many products with closed-source codebases — like Cursor and Claude Code — still maintain GitHub presences primarily for collecting user feedback, often accumulating huge numbers of stars despite providing little or no actual source code.

AI Coding, Model Serving, and LLMOps are all on an upward trajectory. AI Coding stands out with a steep growth curve — once again confirming that boosting R&D efficiency with AI is the application scenario truly taking root in 2025.

On the other hand, Agent Frameworks and AI Data have shown noticeable declines. The drop in Agent Frameworks is closely tied to reduced community investment from once-dominant projects like LangChain, LlamaIndex, and AutoGen.

Projects on The Brink List

Some projects didn't make it into this version but still show strong potential. Among the projects that dropped out, many appear to be heading toward the "AI graveyard":

  • Manus briefly exploded in popularity, inspiring open-source forks like OpenManus and OWL, but the hype proved short-lived.
  • NextChat, one of the earliest popular LLM client apps, lost ground to newer entrants like Cherry Studio and LobeChat.
  • Bolt.new, once a trendy full-stack web dev tool, was open-sourced as template repos with little external contribution.
  • MLC-LLM and GPT4All were once widely used for on-device deployment, but Ollama emerged as the clear winner in this niche.
  • FastChat evolved into the more successful SGLang and LMArena platforms.
  • Text Generation Inference (TGI) was gradually abandoned by Hugging Face as performance fell behind vLLM and SGLang.

100 Days of Change and Continuity

Beyond project reshuffling, the jump from 1.0 to 2.0 brought refinements to how we define and describe the ecosystem. The broad categories of "Infrastructure" and "Application" were restructured into three clearer domains: AI Infra, AI Agent, and AI Data.

New Fields and Projects Entering the Spotlight

The most notable shifts are happening in the Agent layer, with high-profile projects emerging across AI Coding, chatbots, and development frameworks. Two projects stand out for their connection to embodied intelligence: AI XiaoZhi (ESP32-based AI voice interaction device) and Genesis (robotics and embodied simulation platform).

On the Infra side, the biggest change is the integration of "model operations" into a more holistic concept: LLMOps — spanning Observability (Langfuse, Phoenix), Evaluation & Benchmarking (Promptfoo), and Agentic Workflow Runtime Management (1Panel, Dagger).

Top 10 Active Newcomers: Notably, Gemini CLI ranked 3rd and Cherry Studio ranked 7th across all projects in the Landscape — a remarkable showing for first-time entrants.

Top 10 new projects

Note: All data is as of August 1, 2025

What Hasn't Changed: Rise, Fall, and the Cycle of Momentum

Among the new wave, OpenCode was positioned from day one as a 100% open-source alternative to Claude Code. Other newcomers highlight how major players are laying out strategies across model serving, agent toolchains, and AI coding:

  • Dynamo supports vLLM, SGLang, and TensorRT-LLM while being optimized for NVIDIA GPUs
  • adk-python and openai-agents-python are agent builders packaged for Gemini and OpenAI models
  • Gemini CLI and Codex CLI bring autonomous code understanding directly into the command line

The projects showing the most noticeable growth include TensorRT-LLM, verl (RL framework from ByteDance), OpenCode, and Mastra (TypeScript/JavaScript Agent framework). In contrast, the sharpest declines include Eliza, LangChain, LlamaIndex, and AutoGen.

Serving: Making Models Truly Usable

Model serving is about running a trained model in a way that applications can reliably call — not just "can it run?" but "can it run efficiently, controllably, and at scale?" Since 2023, rapid progress has made serving the critical middleware layer connecting AI infrastructure with applications.

Coding: The New Developer Vibe

AI Coding has evolved far beyond basic code completion, now encompassing multimodal support, contextual awareness, and collaborative workflows. CLI tools like Gemini CLI and OpenCode leverage large models to transform developer intent into faster coding. Plugin-based tools such as Cline and Continue integrate into existing development platforms.

Agent: Building Toward AGI

2025 is widely considered the year AI applications truly land. The open-source ecosystem has expanded with projects specializing in different components: Mem0 (memory), Browser-Use (tool use), Dify (workflow execution), and LobeChat (interaction interface) — together shaping a more complete foundation for building autonomous AI systems.

More on GitHub: https://github.com/antgroup/llm-oss-landscape

ABench: An Evolving Open-Source Benchmark

· 阅读需 2 分钟
inclusionAI
Ant Group
GITHUB

🌟 Overview

ABench is an evolving open-source benchmark suite designed to rigorously evaluate and enhance Large Language Models (LLMs) on complex cross-domain tasks. By targeting current model weaknesses, ABench provides systematic challenges in high-difficulty specialized domains, including physics, actuarial science, logical reasoning, law, and psychology.

🎯 Core Objectives

  1. Address Evaluation Gaps: Design high-differentiation assessment tasks targeting underperforming question types
  2. Establish Unified Standards: Create reliable, comparable benchmarks for multi-domain LLM evaluation
  3. Expand Capability Boundaries: Drive continuous optimization of knowledge systems and reasoning mechanisms through challenging innovative problems

📊 Dataset Release Status

DomainDescriptionStatus
Physics500 university/competition-level physics problems (400 static + 100 dynamic parametric variants) covering 10+ fields from classical mechanics to modern physics✅ Released
ActuaryCurated actuarial exam problems covering core topics: probability statistics, financial mathematics, life/non-life insurance, actuarial models, and risk management✅ Released
LogicHigh-differentiation logical reasoning problems from authoritative tests (LSAT/GMAT/GRE/SBI/Chinese Civil Service Exam)🔄 In Preparation
PsychologyPsychological case studies and research questions (objective/subjective) evaluating understanding of human behavior and theories🔄 In Preparation
LawAuthoritative judicial exam materials covering core legal domains: criminal/civil/administrative/procedural/international law🔄 In Preparation

Open Source LLM Development 2025: Landscape, Trends and Insights

· 阅读需 10 分钟
inclusionAI
Ant Group

Originally published on Medium by Ant Open Source.

「AI Surpasses Cloud Native as the Most Influential Tech Domain」

According to OpenRank data from OpenDigger, AI surpassed Cloud Native in 2023 to become the most influential technology domain in terms of community collaboration on GitHub. AI's total influence score overtook Frontend technologies in 2017, accelerated post-2022, and surpassed the declining Cloud Native in 2023 to claim the top spot.

AI surpasses Cloud Native

The LLM Development Ecosystem: A Snapshot

LLM Development Landscape

https://antoss-landscape.my.canva.site

In February 2025, DeepSeek sparked a surge in the LLM development ecosystem. GitHub's Weekly Trending List reached a peak where 94% of the listed repositories were AI-related. This ecosystem is incredibly new and evolving fast — over the past three months, 60% of LLM-related projects that appeared on GitHub Trending were emerged after 2024, and nearly 21% were created in just the last six months.

We build the landscape by first selecting well-known AI projects (e.g., PyTorch, LangChain, vLLM) as seed nodes. By analyzing developer collaboration relationships across "related" GitHub projects, we explored multiple facets of the ecosystem. We rely on the OpenRank influence metric developed by X-lab at East China Normal University — only projects with an average monthly OpenRank score exceeding 10 in year 2025 are included.

As of May 2025, the Open Source LLM Development Landscape 2025 includes 135 projects across 19 technical domains, spanning both Agent application layers and model infrastructure layers.

Below are the details of projects ranked in the Top 20 of OpenRank:

Top 20 by OpenRank

By stack ranking the year-over-year absolute changes in OpenRank between 2024 and 2025, we converged on 3 key observations:

  • Model Training Frameworks: PyTorch remains the undisputed leader. Baidu's PaddlePaddle saw a 41% drop in OpenRank compared to the previous year.
  • Efficient Inference Engines: The high-performance inference engines vLLM and SGLang have undergone rapid iterations, ranking first and third in OpenRank value growth. Their superior GPU inference performance made them the most popular choices for enterprise-level LLM deployment.
  • Low-Code Application (Agent) Development Frameworks: Agent platforms like Dify and RAGFlow, which integrate RAG-based knowledge retrieval, are experiencing rapid growth as they meet the red-hot demand for quickly building AI applications. Notably, both platforms are strong projects emerging from China's developer community.

After observing over 100 open-source projects, we've reached a pivotal point to make a bold claim: the LLM development ecosystem operates like a real-world Hackathon — developers, empowered by AI, now operate as "super individuals" to rapidly build open-source projects around trending topics, with cycles of rapid creation and dissolution driven by speed and iteration.

Key hackathon observations:

1. Developers keep building OSS clones for rapid adoption

When closed-source projects like Devin, Perplexity, and Manus brought shockwaves to the industry, developers quickly replicated open-source versions:

  • Devin & OpenDevin: In March 2024, Xingyao Wang (PhD candidate at UIUC) launched OpenDevin. Within a month, its OpenRank skyrocketed to 190. The project was rebranded as OpenHands and evolved into All Hands AI.
  • Perplexity & Perplexica: Independent developer ItzCrazyKns created Perplexica in 2024 as an open-source alternative. It amassed 22K GitHub stars but OpenRank plateaued around 25.
  • Manus & OpenManus: In March 2025, as Manus went viral, DeepWisdom pulled off a "3-hour replication" with OpenManus, garnering 8K stars on its first day.

2. Ephemeral technical experiments often end up in the AI graveyard

Out of 5,079 AI tools recorded by Dang AI, 1,232 have been archived/abandoned. Dang AI even created an "AI Graveyard." We've curated an "Open-Source AI Graveyard" for projects that gained massive attention upon launch but became inactive — including BabyAGI (April 2023) and Swarm (OpenAI, formally discontinued March 2025).

3. Model capabilities are reshaping application scenarios

  • The decline of AI Search projects: The generalization of model capabilities (GPT-4, Gemini 2.0) is squeezing the market for specialized search tools like Morphic.sh and Scira.
  • The rise of AI Coding projects: Claude 3.7 Sonnet's prowess in coding ushered in "Vibe Coding." IDE plugins like Continue and Cline are thriving open-source options, each with over 3,000 community contributors and steadily rising OpenRank scores.

4. Dynamic competition across ecosystem niches

  • Divergent trajectories of Agent Frameworks: Application platforms like Dify diverged sharply from development frameworks like LangChain. Special mention: DB-GPT, an open-source project initiated by Ant Group, integrates AI application development into big data application scopes.
  • The rise of Reinforcement Learning: DeepSeek-R1's "Aha Moment" demonstrated RL's effectiveness as a post-training approach. Frameworks like Verl and OpenRLHF have seen remarkable growth. In February, inclusionAI fully open-sourced their RL framework AReaL, designed to train large inference models that anyone can reproduce.
  • The blurring of Technical Boundaries: Vector databases, once standalone, now compete with traditional big data systems (e.g., OceanBase adding vector storage support) while maintaining a delicate ecological equilibrium.

We observed and summarized 7 relatively clear technical trends including emerging paradigms such as Agent Frameworks, AI-native communication protocols like MCP, and Coding Agents at the application layer.

7 Technical Trends

1. The Agent Frameworks Boom Diverged in 2025

From 2023 to 2024, "all-in-one" frameworks like LangChain dominated with their pioneering task orchestration capabilities. A huge number of new Agent development frameworks emerged, many focusing on specific features such as tool calling, RAG integration, long-context memory, or ReAct planning.

By the second half of 2024, only a few new frameworks entered the ecosystem. As the initial hype faded, early market leaders like LangChain were gradually declining due to steep learning curves.

Entering 2025, the market showed signs of divergence: platforms like Dify and RAGFlow became extremely popular by offering low-code workflows and enterprise-grade service deployments. In contrast, development frameworks like LangChain and LlamaIndex have been steadily losing ground.

Dify has accurately captured enterprise user needs — offering intuitive visual workflow orchestration, comprehensive enterprise-grade security, and significantly lowering the technical barrier for amateur users.

2. Standard Protocol Layer: The Strategic Battleground

  • 2022: Wild West Era — ad-hoc prompt engineering for tool interaction.
  • 2023: OpenAI's GPT4-0613 introduced Function Calling with standardized API.
  • 2024: Anthropic's Model Context Protocol (MCP), open-sourced November 2024, standardized agent-tool communication. By Q1 2025, MCP became the de facto standard.
  • 2025: Protocol "War" begins:
    • April: Google open-sourced the Agent2Agent (A2A) protocol for communication between multiple agents.
    • May: CopilotKit launched the Agent-User Interaction (AG-UI) protocol with 2.2K GitHub stars in its first week.

The emergence of MCP, A2A, and AG-UI signals LLM applications evolving toward a microservices architecture. The open-source ecosystem will become the battlefield of both standards and their reference designs.

3. The Irresistible Vibe Coding Software Development Paradigm

When Andrej Karpathy introduced the term "vibe coding," it seemed to capture "The Trend" in the upcoming productivity domain. Our research reveals a market pattern:

Major tech companies have rapidly entered AI coding primarily with closed-source offerings: GitHub Copilot, Amazon Q Developer, Huawei's CodeArts Snap, Alibaba's Tongyi Lingma, ByteDance's Trae, and Ant Group's CodeFuse.

Startup ventures and small teams have demonstrated remarkable agility. A prime example is Continue's "continuedev," which gained substantial attention through lean operations and flexible innovation. The sector's potential was endorsed by OpenAI's reported $3 billion acquisition offer for Windsurf.

AI coding tools are advancing beyond basic snippet generation to tackle full-scale development workflows, though substantial challenges remain in semantic validation, multi-language coordination, and security-sensitive code generation.

4. The Shifting Boundaries of Vector Indexing and Storage

The evolution of vector databases can be described as a journey "from explosive hype to rational consolidation." Around February 2023, projects like Qdrant and Chroma saw an unprecedented surge, amassing over 5,000 GitHub stars. However, this initial frenzy failed to sustain long-term momentum.

Several factors contributed to equilibrium:

  1. Closed-source commercial competitors like Pinecone demonstrated strong product capabilities.
  2. Traditional databases (PostgreSQL, MongoDB Atlas, ElasticSearch) introduced vectorization via plugins like pgvector.
  3. The OpenCore model prioritizes ecosystem expansion over community metrics.

Despite these pressures, large-scale enterprise demands for cloud-native scalability and compliance still favor specialized vector databases. MilVus, under neutral LF AI & Data stewardship, has consistently maintained a stable leading position.

5. The Evolution of Multimodal Data Governance in the Age of LLMs

In data lake table formats, Apache Iceberg, Apache Hudi, Apache Paimon, and Delta Lake have formed a "quadropoly." Iceberg has solidified its position as the universal framework, while Hudi and Paimon excel in real-time incremental processing.

The metadata governance and data catalog space sees OpenMetadata and DataHub maintaining leadership, with newcomers like Apache Gravitino and Unity Catalog emerging as potential disruptors. These tools are expanding to include unstructured data and AI assets.

6. The Ongoing Horse Racing in Model Serving and Inference

Three critical factors have emerged as core deal-makers or deal-breakers: inference efficiency, resource utilization, and deployment flexibility. The Top 10 ranking list reshuffles constantly, with contenders like Tsinghua University's KTransformers and NVIDIA's Dynamo continually challenging the status quo.

A potential duopoly is forming: vLLM and SGLang, currently the two most prominent inference engines in the LLM space. In Q1 2025, vLLM's OpenRank grew at 17%, while SGLang surged to 31%.

This duel carries notable academic pedigree: UC Berkeley, birthplace of Spark and Ray, again demonstrates its open-source alchemy. vLLM originated from Berkeley's SkyLab; SGLang from LMSYS, the multi-university research consortium that created Chatbot Arena.

Other notable engines:

  • Ollama & llama.cpp: The lightweight powerhouses for edge inference and on-premise deployment
  • KTransformers: Enabled running full 671B parameter models (DeepSeek-R1/V3) on consumer hardware with 3–28x speedups, triggering a 34x OpenRank spike

7. The PyTorch-Centric Training Ecosystem

PyTorch has undeniably become the dominant force and de facto standard in LLM development. Its modular, lightweight design propelled it past TensorFlow in 2020, while TensorFlow, MXNet, and Caffe faded into obsolescence.

In September 2022, Meta transferred PyTorch's governance to the Linux Foundation, establishing the PyTorch Foundation. Through PyTorch's nearly overwhelming ecosystem gravitational pull, this sub-foundation has grown into a powerful umbrella organization:

  • March 2025: Inference engine SGLang joined the PyTorch ecosystem
  • May 2025: vLLM and distributed training platform DeepSpeed joined the PyTorch Foundation

Community data still reveals Meta's substantial behind-the-scenes influence: the repository's top contributors are all identifiable Meta staff, and over 9,000 pull requests (9% of all PRs) carry the "fb-exported" label.

Conclusion

Ant Group's Open Source team initiated this landscape project to understand the full picture of the LLM development ecosystem, including emerging trends and cutting-edge popular projects. One of our missions is to leverage insights from the open-source community to guide Ant Group's architectural and technological decisions.

This report reflects Ant Group's perspective as a technology enterprise, utilizing X-lab's OpenRank evaluation metrics alongside extensive consultations with technical experts and open-source community developers.

Full Author List: Xiaoya Xia, Sikang Bian, Chao Dong, Xu Wang (AntOSS) Shengyu Zhao, Fanyu Han, Jiaheng Peng, Zhen Zhang, Wei Wang (X-lab)

More on GitHub: https://github.com/antgroup/llm-oss-landscape

In RL we trust — AReaL v0.2 (Boba) Release

· 阅读需 6 分钟
inclusionAI
Ant Group

Originally published on Medium by Ant Open Source.

AReaL v0.2 Boba

We are excited to release AReaL v0.2 (Boba), featuring three major milestones:

  • SGLang Support: With the addition of SGLang support and a series of engineering optimizations, AReaL v0.2 achieves a speed improvement of 1.5x over AReaL v0.1 on 7B models.
  • SOTA 7B Model: AReaL's RL training becomes more stable and sample-efficient. We obtain a SOTA 7B model in mathematical reasoning, achieving pass@1 score of 61.9 on AIME24 and 48.3 on AIME25 respectively.
  • Competitive 32B Model: The highly competitive 32B model was trained with extremely low cost, achieving results comparable to QwQ-32B using only 200 data samples.

Performance comparison table

The table shows performance of AReaL-boba-RL-7B and AReaL-boba-SFT-32B. Note that we obtain SOTA 7B model using RL on math reasoning. We also train a highly competitive 32B model using only 200 data samples, replicating QwQ-32B's inference performance on AIME 2024.

Training Speed Comparison

AReaL-boba throughput comparison with v0.1.0

AReaL-boba throughput comparison with v0.1.0

AReaL v0.2.0 features the following system optimizations:

Upgraded Generation Backend: vLLM 0.6.3 → SGLang v0.4.0

The generation backend has been upgraded leveraging SGLang's radix attention mechanism to significantly improve throughput in scenarios where multiple responses are sampled from the same prompt. SGLang automatically flushes radix caches upon weight updates, ensuring correctness in on-policy RL.

Optimized Training for Variable-Length Sequences & Large Batches

To handle variable sequence lengths efficiently, we eliminate padding and pack sequences into 1D tensors instead. A dynamic allocation algorithm optimally distributes sequences under a maximum token budget, balancing micro-batch sizes while minimizing the number of micro-batches. This approach maximizes GPU memory utilization.

High-Performance Data Transfer for 1K-GPU Scaling

AReaL employs NCCL with GPU-Direct RDMA (GDRDMA) over InfiniBand/RoCE, enabling direct GPU-to-GPU communication that bypasses costly CPU-mediated transfers and PCIe bottlenecks. This keeps generation-to-training data transfer overhead below 3 seconds even in a large 1,000-GPU cluster.

Training Recipe

SOTA 7B model using RL on math reasoning

Base Model

We use R1-Distill-Qwen-7B as our foundation model.

Dataset Curation

Our training dataset (AReaL-boba-106k) combines resources from multiple open-source projects:

We enhanced this with challenging problems from NuminaMath (AoPS/Olympiad subsets) and ZebraLogic.

To maintain an appropriate difficulty level, overly simple questions were filtered out. Specifically, we generate 8 solutions per question using DeepSeek-R1-Distill-Qwen-7B and filter out questions where all solutions were correct.

Reward Function

We adopt a sparse sequence-level reward mechanism. The model is instructed to enclose the final answer within \boxed{}, and the boxed answer is then verified. Correct responses receive a reward of +5, while incorrect ones are penalized with -5.

Notably, we observe that the KL reward can impair performance, particularly in long chain-of-thought training, so we set it to zero.

RL Algorithm

We employ Proximal Policy Optimization (PPO) as our training algorithm and remove the critic model to save compute. We set both the discount factor γ and the GAE parameter λ to 1. Such practices are also adopted by the Open-Reasoner-Zero project.

Token-Level Loss Normalization

Averaging the loss at the sequence level can underweight the overall contribution of longer texts. To address this, we normalize the loss at the token level, as also highlighted in DAPO.

Rollout Strategy

During the rollout phase, we sample 512 questions per batch, and the LLM generates 16 responses per question — resulting in a total batch size of 8,192. To minimize output truncation, we set the maximum generation length to 27K tokens. In our experiment, the truncation rate remained below 5%.

Key Hyperparameters

Key hyperparameters table

This configuration balances convergence speed with training stability.

Approaching QwQ-32B's performance using only 200 data samples

For the 32B model size, we further refine the training data and release AReaL-boba-SFT-200, a high-quality dataset with only 200 data points. Accompanied by relevant training scripts, we replicated QwQ-32B's inference performance on AIME 2024 via Supervised Fine-Tuning (SFT).

Evaluation Best Practices

During evaluation, we use vLLM v0.6.3 as the generation framework. We recommend manually configuring the following options:

enforce_eager=True
enable_chunked_prefill=False
disable_custom_all_reduce=True
disable_sliding_window=True

Following the practice of DeepSeek models, we incorporate a directive in the prompt: "Please reason step by step, and enclose your final answer in \boxed{}." To encourage long context reasoning, we also enforce that the model begins each response with \n.

To ensure reliable pass@1 estimation, we:

  • Sample 32 answers per problem
  • Use temperature=0.6 and top_p=0.95 for SFT models
  • Maintain training temperature (1.0) for RL models

Conclusion & Future Work

Our results demonstrate that high-quality data is equally critical as algorithmic innovations. When conducting RL training on a powerful base model, we require more challenging problems to facilitate learning. A straightforward strategy for data filtering involves removing problems that the base model consistently solves correctly across multiple sampling attempts.

AReaL delivers stable and fast training with cutting-edge model performances. Since initial release, we've continuously improved system efficiency, training stability, and accessibility.

Looking ahead, the AReaL team will:

  • Further optimize system performance
  • Introduce new features
  • Continue open-sourcing training data
  • Expand to broader reasoning tasks

We believe these contributions lower the barrier for high-quality RL training while pushing the boundaries of reasoning capabilities. We welcome community feedback and collaboration to drive further progress.

PromptCoT & PromptCoT-Mamba: Advancing the Frontiers of Reasoning

· 阅读需 5 分钟
inclusionAI
Ant Group

News

  • May 30, 2025: PromptCoT-Mamba released! Introducing an attention-free foundation model for reasoning tasks.
  • Apr 11, 2025: PromptCoT-QwQ-32B model and its training data released, achieving new state-of-the-art results.
  • Mar 7, 2025: PromptCoT project launched, including the problem generation model, distilled models (PromptCoT-DS series), and associated datasets.

Overview

This repository unifies two synergistic projects aimed at advancing the frontiers of mathematical and code reasoning in Large Language Models (LLMs): PromptCoT and PromptCoT-Mamba.

PromptCoT (Synthesizing Olympiad-Level Problems for Mathematical Reasoning in Large Language Models) addresses the critical challenge of acquiring high-quality, complex problems for training advanced LLMs. It introduces a novel methodology to systematically generate Olympiad-level mathematical problems by modeling the rationale behind expert problem design. This approach not only enhances problem diversity and difficulty but also ensures logical consistency in problem construction, providing a scalable solution for creating robust training datasets.

PromptCoT-Mamba (Scaling Reasoning without Attention) leverages the problem generation capabilities of the PromptCoT pipeline to train PromptCoT-Mamba-7B, the first attention-free foundation model based on the Mamba-2 architecture. This model demonstrates that structured training curricula can enable attention-free models to surpass strong Transformer baselines on a wide array of competition-level math and code reasoning tasks, all while maintaining constant-memory inference without KV caching.

Together, these projects offer a powerful suite of tools, models, and datasets for researchers and developers working on the cutting edge of AI reasoning.


Highlights & Key Results

1. PromptCoT: Problem Generation & Distilled Models

  • ✨ The Missing Piece for Test-Time Scaling: A lightweight yet powerful problem generation model enabling the construction of prompt sets at any scale with sufficient quality, perfect for SFT or RL post-training.
  • 📖 A Fully Open Project: All models (generation, distilled LLMs) and datasets (generation inputs, SFT data) are open-sourced.
  • 🏆 Superior Performance of Distilled Models:
    • PromptCoT-DS-7B consistently surpasses its base model, DeepSeek-R1-Distill-Qwen-7B, with significant gains:
      • +0.9% on MATH-500 (93.7%)
      • +3.2% on AIME2024 (58.7%)
      • +9.2% on AIME2025 (49.2%)
    • PromptCoT-DS-7B (7B parameters) achieves results comparable to larger 32B models like S1-32B and LIMO-32B.
    • PromptCoT-QwQ-32B sets a new standard, outperforming other 32B models by a significant margin:
      • MATH-500: 96.7% ± 0.5%
      • AIME2024: 83.8% ± 2.8%
      • AIME2025: 75.4% ± 4.7%
    • PromptCoT-DS-1.5B demonstrates competitive performance against RL-based models purely through distillation.
  • ⚡ Efficiency Without Compromise: PromptCoT-DS-1.5B achieves 40+% AIME scores using over 15× fewer A100 GPU hours compared to models like DeepScaleR-1.5B-Preview.

2. PromptCoT-Mamba: Attention-Free Reasoning

  • 🚀 First Attention-Free SOTA: PromptCoT-Mamba-7B is the first attention-free model (Mamba-2 architecture) to outperform strong Transformer baselines in math and code reasoning.
  • 🧠 Trained with PromptCoT Pipeline: Utilizes a structured, two-stage curriculum with data generated by PromptCoT.
  • 💪 Strong General Performance: PromptCoT-Mamba-7B consistently outperforms 7B-scale Transformer and hybrid Mamba-Transformer baselines.
    • MATH-500: 84.6%
    • AIME 2024: 35.2%
    • AIME 2025: 24.6%
    • Livecodebench: 29.9%
  • 🎯 Math Specialization: The math-specialized variant, PromptCoT-Mamba-Math-7B, further boosts math performance:
    • MATH-500: 88.0%
    • AIME 2024: 42.9% (+7.7% over generalist)
    • AIME 2025: 30.8% (+6.2% over generalist)
  • Inference Efficiency: Offers substantial speedups (e.g., 3.66× faster on 24GB GPU for long sequences) and constant-memory inference, ideal for cost-sensitive or long-context workloads.

Performance Details

PromptCoT Series Performance

ModelGSM8KMATH-500AIME2024AIME2025
🔹 1.5B Models
DeepSeek-R1-Distill-Qwen-1.5B-83.9%28.9%28.1%
STILL-3-1.5B-preview-85.5%39.3%-
DeepScaleR-1.5B-Preview-🟢 87.8%🟢 43.1%🟢 37.1%
PromptCoT-DS-1.5B (ours)🟢 87.6% ± 0.5%85.3% ± 1.1%41.2% ± 6.9%36.7% ± 6.2%
🔹 7B Models
DeepSeek-R1-Distill-Qwen-7B-92.8%55.5%40.0%
Qwen2.5-7B-SimpleRL-82.4%26.7%-
OpenThinker-7B-89.6%30.0%33.3%
OpenR1-Qwen-7B-90.6%36.7%40.0%
PromptCoT-DS-7B (ours)🔥 92.8% ± 0.5%🔥 93.7% ± 0.7%🔥 58.7% ± 3.1%🔥 49.2% ± 7.9%
🔹 32B Models
DeepSeek-R1-Distill-Qwen-32B-94.3%72.6%-
S1-32B-93.0%56.7%26.6%
LIMO-32B-94.8%57.1%46.6%
QwQ-32B--82.1%70.8%
PromptCoT-QwQ-32B (ours)🔥🔥 96.4% ± 0.2%🔥🔥 96.7% ± 0.5%🔥🔥 83.8% ± 2.8%🔥🔥 75.4% ± 4.7%

PromptCoT-Mamba Performance

General Performance:

ModelMATH-500AIME 24AIME 25OlympiadBenchHumanEvalHumanEval+Livecodebench
PromptCoT-Mamba-7B84.6🔥🔥35.2🔥🔥24.650.781.775.0🔥🔥29.9
Gemma3-27B89.032.624.054.286.078.026.9
Gemma3-12B83.822.919.249.981.173.222.2
Sky-T1-7B85.019.219.249.241.537.218.3
S1.1-7B82.019.217.543.164.056.713.3
Bespoke-Stratos-7B81.218.316.345.073.268.38.6
Nemotron-H-8B77.6------79.374.4--
M1-3B81.723.022.043.6------

Math Specialization vs. Generalist:

ModelMATH-500AIME 24AIME 25OlympiadBenchHumanEvalHumanEval+Livecodebench
PromptCoT-Mamba-Math-7B🔥🔥88.0🔥🔥42.9🔥🔥30.8🔥🔥52.171.366.520.3
PromptCoT-Mamba-7B84.635.224.650.781.775.029.9

Citation

If you find PromptCoT or PromptCoT-Mamba useful in your research, please consider citing the respective papers:

For PromptCoT:

@article{zhao2025promptcot,
author = {Zhao, Xueliang and Wu, Wei and Guan, Jian and Kong, Lingpeng},
title = {PromptCoT: Synthesizing Olympiad-Level Problems for Mathematical Reasoning in Large Language Models},
year = {2025},
journal = {arXiv preprint arXiv:2503.02324},
url = {http://arxiv.org/abs/2503.02324}
}

For PromptCoT-Mamba:

@article{zhao2025scaling,
author = {Xueliang Zhao and Wei Wu and Lingpeng Kong},
title = {Scaling Reasoning without Attention},
journal = {arXiv preprint arXiv:2505.22425},
year = {2025},
url = {https://arxiv.org/abs/2505.22425}
}