Xiaomi MiMo-V2.5 Series API Gets Permanent Price Cut: Up to 99% Off

Xiaomi slashes MiMo-V2.5 series API prices by up to 99%, no longer differentiates by context window length. The new pricing, effective May 27, brings MiMo-V2.5 Pro input cache hits down to just 0.025 yuan per million tokens.

Today, Xiaomi announced a permanent price reduction for its MiMo-V2.5 series large language models' API pricing. Compared to the original API pricing, the new prices for the MiMo-V2.5 series see reductions of up to 99%, and the pricing no longer differentiates by context window length.

The adjustment took effect globally at 00:00 Beijing time on May 27, 2026.

New Pricing Details

The price cut covers both MiMo-V2.5 and MiMo-V2.5 Pro editions:

MiMo-V2.5 Pro

Input cache hit: 0.025 yuan / million tokens (up to 99% off)
Output: 6 yuan / million tokens (up to 86% off)

MiMo-V2.5

Input cache hit: 0.02 yuan / million tokens (up to 98% off)
Output: 2 yuan / million tokens (up to 93% off)

Token Plan Revamp

Beyond API price cuts, the Token Plan billing system has also been significantly overhauled. Usage quotas have been increased to 5-8x the original at no additional cost, and a new Credits concept is introduced to make billing rules clearer and easier to understand.

Technical Optimizations

Xiaomi attributed the price cut to ongoing optimizations of its inference system. The team implemented full Sliding Window Attention (SWA) support based on SGLang HiCache, which reduced KV Cache data movement between GPU memory, CPU memory, and SSD to nearly 1/7 of previous levels, while increasing the number of cacheable tokens by nearly 5x — significantly improving cache hit rates and inference efficiency.

Additionally, Xiaomi optimized its expert parallelism scheme and input-length bucketing strategy to further improve cluster throughput, continuously lowering per-token service costs while maintaining quality.

Context

This price cut places MiMo-V2.5 Pro in the lowest pricing tier among major Chinese LLM providers, making it highly competitive for high-frequency API workloads such as chatbots, content generation, and customer service automation.