温馨提示:本站仅提供公开网络链接索引服务,不存储、不篡改任何第三方内容,所有内容版权归原作者所有
AI智能索引来源:http://www.linkedin.com/top-content/productivity/performance-optimization-techniques/quantization-techniques-for-large-scale-data-processing/
点击访问原文链接

Quantization Techniques for Large-Scale Data Processing

Quantization Techniques for Large-Scale Data Processing 跳到主要内容 领英 马上加入 登录 热门内容 Productivity Performance Optimization Techniques Quantization Techniques for Large-Scale Data Processing

浏览来自职场专家的热门领英内容。

摘要

Quantization techniques for large-scale data processing are methods that reduce the size and complexity of data by converting high-precision values into lower-precision representations, making storage and computation more efficient without sacrificing much accuracy. These approaches are especially important for handling massive datasets and deploying machine learning models in environments with limited resources.

Choose smart compression: Selecting the right quantization method, such as product or rotational quantization, can deliver significant storage savings and speed up search operations while keeping accuracy high. Match precision to needs: Adjusting the bit-width of data representations lets you balance resource use and performance, making it easier to deploy models across different hardware setups. Streamline deployment: Using flexible quantization strategies, like group-wise or nested models, allows you to maintain strong search and retrieval quality even when working with strict memory or computing constraints. 由 AI 根据领英会员动态总结
Philip Vollet

VP Developer Relations & Growth @ Weaviate - Building memory

135,471 位关注者 6 个月 举报此动态 关闭菜单

4× compression. 42% faster queries. BETTER accuracy. Sounds impossible right? Compression is supposed to be a tradeoff BUT 8-bit Rotational Quantization breaks the rules. What makes this absolutely wild: 𝗧𝗵𝗲 𝘁𝘆𝗽𝗶𝗰𝗮𝗹 𝗰𝗼𝗺𝗽𝗿𝗲𝘀𝘀𝗶𝗼𝗻 𝘀𝘁𝗼𝗿𝘆: You compress vectors to save memory, you lose some quality, maybe you gain some speed. Pick your poison. 𝗧𝗵𝗲 𝗥𝗤 𝘀𝘁𝗼𝗿𝘆: 4x compression + 2.3x faster distance computation + 98-99% recall (basically perfect) + 15-50% higher throughput MIND BLOWN How does this even work? Two steps: 𝗙𝗮𝘀𝘁 𝗽𝘀𝗲𝘂𝗱𝗼𝗿𝗮𝗻𝗱𝗼𝗺 𝗿𝗼𝘁𝗮𝘁𝗶𝗼𝗻 using the Walsh-Hadamard Transform that redistributes your vector's information evenly across all dimensions (this is the clever part that makes everything else possible) 𝗦𝗰𝗮𝗹𝗮𝗿 𝗾𝘂𝗮𝗻𝘁𝗶𝘇𝗮𝘁𝗶𝗼𝗻 to 8-bit integers with the quantization interval defined by each vector's min/max values The rotation is doing something beautiful here - it's making EVERY vector universally well-suited for quantization regardless of structure or position in your dataset. No training needed, no cluster centers, just pure mathematical elegance. We went DEEP on the technical implementation in the full blog post. The engineering to make this production-ready is genuinely impressive 💙 This is now available in Weaviate 1.32+ and we see it as a better default than uncompressed vectors for most use cases. Blog: https://lnkd.in/e7aBbaeG

…展开 无上一项内容 无下一项内容 Philip Vollet

VP Developer Relations & Growth @ Weaviate - Building memory

4× compression. 42% faster queries. BETTER accuracy. Sounds impossible right? Compression is supposed to be a tradeoff BUT 8-bit Rotational Quantization breaks the rules. What makes this absolutely wild: 𝗧𝗵𝗲 𝘁𝘆𝗽𝗶𝗰𝗮𝗹 𝗰𝗼𝗺𝗽𝗿𝗲𝘀𝘀𝗶𝗼𝗻 𝘀𝘁𝗼𝗿𝘆: You compress vectors to save memory, you lose some quality, maybe you gain some speed. Pick your poison. 𝗧𝗵𝗲 𝗥𝗤 𝘀𝘁𝗼𝗿𝘆: 4x compression + 2.3x faster distance computation + 98-99% recall (basically perfect) + 15-50% higher throughput MIND BLOWN How does this even work? Two steps: 𝗙𝗮𝘀𝘁 𝗽𝘀𝗲𝘂𝗱𝗼𝗿𝗮𝗻𝗱𝗼𝗺 𝗿𝗼𝘁𝗮𝘁𝗶𝗼𝗻 using the Walsh-Hadamard Transform that redistributes your vector's information evenly across all dimensions (this is the clever part that makes everything else possible) 𝗦𝗰𝗮𝗹𝗮𝗿 𝗾𝘂𝗮𝗻𝘁𝗶𝘇𝗮𝘁𝗶𝗼𝗻 to 8-bit integers with the quantization interval defined by each vector's min/max values The rotation is doing something beautiful here - it's making EVERY vector universally well-suited for quantization regardless of structure or position in your dataset. No training needed, no cluster centers, just pure mathematical elegance. We went DEEP on the technical implementation in the full blog post. The engineering to make this production-ready is genuinely impressive 💙 This is now available in Weaviate 1.32+ and we see it as a better default than uncompressed vectors for most use cases. Blog: https://lnkd.in/e7aBbaeG

…展开 686 12 条评论 评论 复制 LinkedIn Facebook X 关闭菜单 分享 686 12 条评论 评论 分享 复制 LinkedIn Facebook X 关闭菜单
Krishna Konar

Data Engineer @ Meta Reality Labs | ML Data Infrastructure and Distributed Systems

5,966 位关注者 8 个月 举报此动态 关闭菜单

How to compress 2.88 TB of vector embeddings down to 97 GB I just wrote a detailed article on Product Quantization (PQ), the technique powering efficient similarity search in modern vector databases. The challenge is... Storing 1 billion 768-dimensional embeddings requires 2.88 TB of memory. That's expensive and impractical for most applications. 1,000,000,000 vectors × 768 dimensions × 4 bytes = 3,072,000,000,000 bytes The Solution: Product Quantization achieves 30x compression while maintaining good search accuracy depends on sub-vector selection (Recall@10). How it works: - Split vectors into sub-vectors (e.g., 96 sub-vectors of 8 dimensions each) - Learn a codebook of centroids for each sub-vector using K-means - Replace each sub-vector with a 1-byte code pointing to its nearest centroid - Search using fast table lookups instead of full distance calculations Real-world impact: With 96 sub-vectors and 256 centroids: 1 billion vectors → just ~97 GB This article walks through a complete step-by-step example with actual numbers, showing exactly how the encoding and search process works.

…展开 Understanding Product Quantization: A Step-by-Step Guide Krishna Konar,发布于领英 23 评论 分享 复制 LinkedIn Facebook X 关闭菜单
Rahul Agarwal

Staff ML Engineer | Meta, Roku, Walmart | 1:1 @ topmate.io/MLwhiz

45,948 位关注者 3 个月 举报此动态 关闭菜单

A colleague asked me to explain Product Quantization last week. Twenty minutes later, we were both completely lost. I knew IVF-PQ worked - I had used FAISS dozens of times. But explaining it from scratch? That is where things got messy. So I went back and actually figured it out. My latest post is the explanation I wish I had. Here is what we cover: - Why Brute Force search falls apart at 100M+ vectors - How IVF partitions space so you are not searching everything - How Product Quantization (PQ) compresses vectors by 64x - Why IVF-PQ combined is what most production search engines use - Metadata filtering for real-world constraints - FAISS vs Milvus vs Pinecone - which stack to pick and when - How to benchmark recall, QPS, latency, and memory Part 5 of my RecSys for MLEs series - a deep dive into the vector infra that makes retrieval work at scale. https://lnkd.in/ehA_WRFk Have you ever had to explain IVF-PQ to someone from scratch? How did it go? #MachineLearning #RecommenderSystems #MLEngineering #DataScience

…展开 Vector Search at Scale: The Production Engineer's Guide mlwhiz.com 69 2 条评论 评论 分享 复制 LinkedIn Facebook X 关闭菜单
Zain Hasan

I build and teach AI | AI/ML @ Together AI | EngSci ℕΨ/PhD @ UofT | Previously: Vector DBs, Data Scientist, Lecturer & Health Tech Founder | 🇺🇸🇨🇦🇵🇰

20,106 位关注者 1 年 举报此动态 关闭菜单

The researchers at Google DeepMind just introduced "Matryoshka Quantization" (MatQuant), a clever new technique that could make deploying large language models much more efficient. The key insight? Rather than creating separate models for different quantization levels (int8, int4, int2), MatQuant leverages the nested "Matryoshka" structure naturally present in integer data types. Think of it like Russian nesting dolls - the int2 representation is nested within int4, which is nested within int8. Here are the major innovations: 1. Single Model, Multiple Precisions >> MatQuant trains one model that can operate at multiple precision levels (int8, int4, int2) >> You can extract lower precision models by simply slicing the most significant bits >> No need to maintain separate models for different deployment scenarios 2. Improved Low-Precision Performance >> Int2 models extracted from MatQuant are up to 10% more accurate than standard int2 quantization >> This is a huge breakthrough since int2 quantization typically severely degrades model quality >> The researchers achieved this through co-training and co-distillation across precision levels 3. Flexible Deployment >> MatQuant enables "Mix'n'Match" - using different precisions for different layers >> You can interpolate to intermediate bit-widths like int3 and int6 >> This allows fine-grained control over the accuracy vs. efficiency trade-off The results are impressive. When applied to the FFN parameters of Gemma-2 9B: >> Int8 and int4 models perform on par with individually trained baselines >> Int2 models show significant improvements (8%+ better on downstream tasks) >> Remarkably, an int2 FFN-quantized Gemma-2 9B outperforms an int8 FFN-quantized Gemma-2 2B This work represents a major step forward in model quantization, making it easier to deploy LLMs across different hardware constraints while maintaining high performance. The ability to extract multiple precision levels from a single trained model is particularly valuable for real-world applications. Looking forward to seeing how this technique gets adopted by the community and what further improvements it enables in model deployment efficiency! Let me know if you'd like me to elaborate on any aspect of the paper. I'm particularly fascinated by how they managed to improve int2 performance through the co-training approach. https://lnkd.in/g6mdmVjx

…展开 无上一项内容 无下一项内容 Zain Hasan

I build and teach AI | AI/ML @ Together AI | EngSci ℕΨ/PhD @ UofT | Previously: Vector DBs, Data Scientist, Lecturer & Health Tech Founder | 🇺🇸🇨🇦🇵🇰

The researchers at Google DeepMind just introduced "Matryoshka Quantization" (MatQuant), a clever new technique that could make deploying large language models much more efficient. The key insight? Rather than creating separate models for different quantization levels (int8, int4, int2), MatQuant leverages the nested "Matryoshka" structure naturally present in integer data types. Think of it like Russian nesting dolls - the int2 representation is nested within int4, which is nested within int8. Here are the major innovations: 1. Single Model, Multiple Precisions >> MatQuant trains one model that can operate at multiple precision levels (int8, int4, int2) >> You can extract lower precision models by simply slicing the most significant bits >> No need to maintain separate models for different deployment scenarios 2. Improved Low-Precision Performance >> Int2 models extracted from MatQuant are up to 10% more accurate than standard int2 quantization >> This is a huge breakthrough since int2 quantization typically severely degrades model quality >> The researchers achieved this through co-training and co-distillation across precision levels 3. Flexible Deployment >> MatQuant enables "Mix'n'Match" - using different precisions for different layers >> You can interpolate to intermediate bit-widths like int3 and int6 >> This allows fine-grained control over the accuracy vs. efficiency trade-off The results are impressive. When applied to the FFN parameters of Gemma-2 9B: >> Int8 and int4 models perform on par with individually trained baselines >> Int2 models show significant improvements (8%+ better on downstream tasks) >> Remarkably, an int2 FFN-quantized Gemma-2 9B outperforms an int8 FFN-quantized Gemma-2 2B This work represents a major step forward in model quantization, making it easier to deploy LLMs across different hardware constraints while maintaining high performance. The ability to extract multiple precision levels from a single trained model is particularly valuable for real-world applications. Looking forward to seeing how this technique gets adopted by the community and what further improvements it enables in model deployment efficiency! Let me know if you'd like me to elaborate on any aspect of the paper. I'm particularly fascinated by how they managed to improve int2 performance through the co-training approach. https://lnkd.in/g6mdmVjx

…展开 57 6 条评论 评论 复制 LinkedIn Facebook X 关闭菜单 分享 57 6 条评论 评论 分享 复制 LinkedIn Facebook X 关闭菜单
Kuldeep Singh Sidhu

Senior Data Scientist @ Walmart | BITS Pilani

16,689 位关注者 1 年 举报此动态 关闭菜单

Groundbreaking Research Alert: 4-bit Quantization for RAG Systems A fascinating new paper from San José State University introduces an innovative approach to optimize Retrieval-augmented Generation (RAG) systems through 4-bit quantization of vector embeddings. >> Technical Deep Dive: The research tackles a critical challenge in RAG systems - the massive memory requirements for storing high-dimensional embedding vectors. Current top-ranked models on MTEB typically use embedding dimensions between 512-4096, consuming substantial memory resources. Consider this: A standard dbpedia dataset with 1M entries and 1536 dimensions requires 6.1GB of RAM just for embeddings. The proposed solution? A sophisticated 4-bit quantization approach that: - Reduces memory footprint by up to 87.5% - Maintains search accuracy within 4% of original performance - Implements group-wise quantization for enhanced precision - Outperforms HNSW algorithm in accuracy with group sizes ≤ 128 >> Under the Hood: The system employs symmetric linear quantization with group-wise processing, where vectors are split into equal-sized groups with individual quantization scales. This approach significantly outperforms traditional Product Quantization methods, maintaining correlation coefficients above 0.82 across multiple semantic textual similarity datasets. >> Impact: This breakthrough enables RAG deployment in resource-constrained environments while maintaining high accuracy. The research demonstrates that intelligent quantization can dramatically reduce infrastructure costs without compromising performance.

…展开 无上一项内容 无下一项内容 Kuldeep Singh Sidhu

Senior Data Scientist @ Walmart | BITS Pilani

Groundbreaking Research Alert: 4-bit Quantization for RAG Systems A fascinating new paper from San José State University introduces an innovative approach to optimize Retrieval-augmented Generation (RAG) systems through 4-bit quantization of vector embeddings. >> Technical Deep Dive: The research tackles a critical challenge in RAG systems - the massive memory requirements for storing high-dimensional embedding vectors. Current top-ranked models on MTEB typically use embedding dimensions between 512-4096, consuming substantial memory resources. Consider this: A standard dbpedia dataset with 1M entries and 1536 dimensions requires 6.1GB of RAM just for embeddings. The proposed solution? A sophisticated 4-bit quantization approach that: - Reduces memory footprint by up to 87.5% - Maintains search accuracy within 4% of original performance - Implements group-wise quantization for enhanced precision - Outperforms HNSW algorithm in accuracy with group sizes ≤ 128 >> Under the Hood: The system employs symmetric linear quantization with group-wise processing, where vectors are split into equal-sized groups with individual quantization scales. This approach significantly outperforms traditional Product Quantization methods, maintaining correlation coefficients above 0.82 across multiple semantic textual similarity datasets. >> Impact: This breakthrough enables RAG deployment in resource-constrained environments while maintaining high accuracy. The research demonstrates that intelligent quantization can dramatically reduce infrastructure costs without compromising performance.

…展开 91 3 条评论 评论 复制 LinkedIn Facebook X 关闭菜单 分享 91 3 条评论 评论 分享 复制 LinkedIn Facebook X 关闭菜单
Kasper Groes Albin Ludvigsen

Leader | Board Chair @ DDSC | Advisor @ Mimiry | Open Source AI | Digital sovereignty

5,217 位关注者 2 个月 举报此动态 关闭菜单

Google just published something that's a gift to self-hosted LLM inference 🚀 TurboQuant is a new KV cache quantization method from Google Research. It compresses KV cache to 3.5 bits with no accuracy loss, no calibration data. The benchmark results are hard to argue with. Tested on several benchmarks using open-weights models, 3.5-bit TurboQuant is statistically indistinguishable from BF16 baseline. On H100s it also achieves up to 8x speedup computing attention logits at 4-bit, compared to 32-bit unquantized keys. For those self-hosting models with vLLM or llama.cpp, the practical implications are significant. A 128K-context 32B session drops from roughly 30 GB of KV cache to around 6 GB. That compression ratio doesn't just save memory; it changes what's feasible on a given server. A few scenarios where this becomes especially useful: 👉 Agentic coding pipelines running on-prem, where long context windows are the norm and multiple agents may be running concurrently. The freed memory can support far more parallel sessions on the same hardware. 👉 RAG-heavy enterprise deployments where large documents are included in context. The bottleneck shifts from "can we fit this in memory" to "what do we actually want to retrieve." 👉 Long-context document analysis and summarization workflows, where 32B or 70B models at 64K+ context have historically been memory-prohibitive on single-node setups. On the implementation side, it's moving quickly. Implementation is underway both in vLLM and llama.cpp. The gap between "interesting research" and "running in production" looks unusually short here. Perhaps because of agentic coding tools?

…展开 无上一项内容 无下一项内容 Kasper Groes Albin Ludvigsen

Leader | Board Chair @ DDSC | Advisor @ Mimiry | Open Source AI | Digital sovereignty

Google just published something that's a gift to self-hosted LLM inference 🚀 TurboQuant is a new KV cache quantization method from Google Research. It compresses KV cache to 3.5 bits with no accuracy loss, no calibration data. The benchmark results are hard to argue with. Tested on several benchmarks using open-weights models, 3.5-bit TurboQuant is statistically indistinguishable from BF16 baseline. On H100s it also achieves up to 8x speedup computing attention logits at 4-bit, compared to 32-bit unquantized keys. For those self-hosting models with vLLM or llama.cpp, the practical implications are significant. A 128K-context 32B session drops from roughly 30 GB of KV cache to around 6 GB. That compression ratio doesn't just save memory; it changes what's feasible on a given server. A few scenarios where this becomes especially useful: 👉 Agentic coding pipelines running on-prem, where long context windows are the norm and multiple agents may be running concurrently. The freed memory can support far more parallel sessions on the same hardware. 👉 RAG-heavy enterprise deployments where large documents are included in context. The bottleneck shifts from "can we fit this in memory" to "what do we actually want to retrieve." 👉 Long-context document analysis and summarization workflows, where 32B or 70B models at 64K+ context have historically been memory-prohibitive on single-node setups. On the implementation side, it's moving quickly. Implementation is underway both in vLLM and llama.cpp. The gap between "interesting research" and "running in production" looks unusually short here. Perhaps because of agentic coding tools?

…展开 182 8 条评论 评论 复制 LinkedIn Facebook X 关闭菜单 分享 182 8 条评论 评论 分享 复制 LinkedIn Facebook X 关闭菜单
Sohrab Rahimi

Director, AI/ML Lead @ Google

23,988 位关注者 2 个月 举报此动态 关闭菜单

Google solved one of the most important constraints of language models by reducing how much memory they need to run, which directly translates into freed compute capacity and lower serving cost. The paper “TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate” from Google Research focuses on a simple observation. LLMs do not rely on exact vector values during inference. They rely on inner products between vectors. If those similarities are preserved, the model behaves the same. The design follows from that. First, vectors are transformed to make compression tractable. A random rotation makes each dimension behave almost independently. That removes the need for complex, data-specific quantization schemes and allows each coordinate to be compressed separately with minimal loss. Second, the method fixes what compression breaks. Standard quantization distorts inner products, which directly affects attention. TurboQuant isolates that distortion and encodes it using a 1-bit correction signal based on a Quantized Johnson–Lindenstrauss transform. This restores unbiased similarity calculations with negligible overhead. The key architectural move is separation. One stage compresses efficiently. The other guarantees correctness of interactions between vectors. That is why the system can push compression without degrading performance. The empirical result is what matters. KV cache 𝗰𝗮𝗻 𝗯𝗲 𝗰𝗼𝗺𝗽𝗿𝗲𝘀𝘀𝗲𝗱 𝗯𝘆 𝗺𝗼𝗿𝗲 𝘁𝗵𝗮𝗻 𝟱𝘅 while maintaining the same downstream accuracy on long-context tasks. This brings compression close to the theoretical limit for this problem. Now the practical impact. Take a RAG system with 10M documents embedded at 1536 dimensions. Stored in float16, that is roughly 30 GB of vector data. With 5x to 8x compression, that drops to around 4–6 GB. The entire index fits in GPU memory. Retrieval becomes local, faster, and cheaper. On the generation side, the KV cache shrinks by the same factor. A system capped at 32k context due to memory can push toward 150k on the same hardware, or run several times more concurrent requests per GPU. In practice, a deployment that required 8 GPUs to serve long-context RAG queries can drop to 2–3 GPUs for the same throughput, or keep the hardware and scale traffic significantly. No retraining. No change to the model. Just a different way of encoding state during inference. This is a significant and imoactful breakthrough. Memory is the bottleneck in modern LLM systems. If you compress it without breaking similarity, you unlock longer context, higher throughput, and materially lower cost at the same time. Blog: https://lnkd.in/ei3Nb5Vv Paper: https://lnkd.in/eyJ4Hf9U

…展开 无上一项内容 无下一项内容 Sohrab Rahimi

Director, AI/ML Lead @ Google

Google solved one of the most important constraints of language models by reducing how much memory they need to run, which directly translates into freed compute capacity and lower serving cost. The paper “TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate” from Google Research focuses on a simple observation. LLMs do not rely on exact vector values during inference. They rely on inner products between vectors. If those similarities are preserved, the model behaves the same. The design follows from that. First, vectors are transformed to make compression tractable. A random rotation makes each dimension behave almost independently. That removes the need for complex, data-specific quantization schemes and allows each coordinate to be compressed separately with minimal loss. Second, the method fixes what compression breaks. Standard quantization distorts inner products, which directly affects attention. TurboQuant isolates that distortion and encodes it using a 1-bit correction signal based on a Quantized Johnson–Lindenstrauss transform. This restores unbiased similarity calculations with negligible overhead. The key architectural move is separation. One stage compresses efficiently. The other guarantees correctness of interactions between vectors. That is why the system can push compression without degrading performance. The empirical result is what matters. KV cache 𝗰𝗮𝗻 𝗯𝗲 𝗰𝗼𝗺𝗽𝗿𝗲𝘀𝘀𝗲𝗱 𝗯𝘆 𝗺𝗼𝗿𝗲 𝘁𝗵𝗮𝗻 𝟱𝘅 while maintaining the same downstream accuracy on long-context tasks. This brings compression close to the theoretical limit for this problem. Now the practical impact. Take a RAG system with 10M documents embedded at 1536 dimensions. Stored in float16, that is roughly 30 GB of vector data. With 5x to 8x compression, that drops to around 4–6 GB. The entire index fits in GPU memory. Retrieval becomes local, faster, and cheaper. On the generation side, the KV cache shrinks by the same factor. A system capped at 32k context due to memory can push toward 150k on the same hardware, or run several times more concurrent requests per GPU. In practice, a deployment that required 8 GPUs to serve long-context RAG queries can drop to 2–3 GPUs for the same throughput, or keep the hardware and scale traffic significantly. No retraining. No change to the model. Just a different way of encoding state during inference. This is a significant and imoactful breakthrough. Memory is the bottleneck in modern LLM systems. If you compress it without breaking similarity, you unlock longer context, higher throughput, and materially lower cost at the same time. Blog: https://lnkd.in/ei3Nb5Vv Paper: https://lnkd.in/eyJ4Hf9U

…展开 51 1 条评论 评论 复制 LinkedIn Facebook X 关闭菜单 分享 51 1 条评论 评论 分享 复制 LinkedIn Facebook X 关闭菜单
Kartik Mathur

AI @ Vectara, Ex-Microsoft

3,600 位关注者 2 个月 举报此动态 关闭菜单

This week, Google Research achieved nearly lossless compression of the KV Cache in frontier models. The paper is called TurboQuant, and it's being presented at ICLR 2026. Here's what it does and why it matters. Large language models keep a "cheat sheet" while they're thinking. It's called the KV Cache, short for Key-Value Cache. Instead of re-reading everything from scratch, the model stores recently used information there so it can retrieve it fast. The catch: this cheat sheet is enormous. It eats up memory, slows things down, and becomes a real bottleneck at scale. The obvious fix is compression. But traditional compression methods introduce their own overhead, adding 1 to 2 extra bits per number and partially canceling out the benefit. TurboQuant compresses the KV Cache down to just 3 bits per number, with no retraining required and no measurable loss in model accuracy. Two algorithms power this approach:  • PolarQuant handles the heavy lifting. Instead of storing a vector using standard X/Y/Z coordinates, it converts it into polar coordinates (think: an angle plus a distance).  • QJL (Quantized Johnson-Lindenstrauss) mops up the residual error using just 1 bit. It acts as a mathematical error-checker that removes bias from the compressed approximation, keeping attention scores accurate. The KV Cache bottleneck is one of the main reasons long-context inference is expensive. But vector quantization is also the foundation of large-scale semantic search, the technology that lets search engines find meaning rather than just keywords. This is the kind of foundational infrastructure work that quietly makes everything else faster and cheaper. Blog: https://lnkd.in/g5NcAFQ5 #KVCache #Quantization #LLMInference #AIResearch #GoogleResearch #MachineLearning #ICLR2026.

…展开 无上一项内容 无下一项内容 Kartik Mathur

AI @ Vectara, Ex-Microsoft

This week, Google Research achieved nearly lossless compression of the KV Cache in frontier models. The paper is called TurboQuant, and it's being presented at ICLR 2026. Here's what it does and why it matters. Large language models keep a "cheat sheet" while they're thinking. It's called the KV Cache, short for Key-Value Cache. Instead of re-reading everything from scratch, the model stores recently used information there so it can retrieve it fast. The catch: this cheat sheet is enormous. It eats up memory, slows things down, and becomes a real bottleneck at scale. The obvious fix is compression. But traditional compression methods introduce their own overhead, adding 1 to 2 extra bits per number and partially canceling out the benefit. TurboQuant compresses the KV Cache down to just 3 bits per number, with no retraining required and no measurable loss in model accuracy. Two algorithms power this approach:  • PolarQuant handles the heavy lifting. Instead of storing a vector using standard X/Y/Z coordinates, it converts it into polar coordinates (think: an angle plus a distance).  • QJL (Quantized Johnson-Lindenstrauss) mops up the residual error using just 1 bit. It acts as a mathematical error-checker that removes bias from the compressed approximation, keeping attention scores accurate. The KV Cache bottleneck is one of the main reasons long-context inference is expensive. But vector quantization is also the foundation of large-scale semantic search, the technology that lets search engines find meaning rather than just keywords. This is the kind of foundational infrastructure work that quietly makes everything else faster and cheaper. Blog: https://lnkd.in/g5NcAFQ5 #KVCache #Quantization #LLMInference #AIResearch #GoogleResearch #MachineLearning #ICLR2026.

…展开 60 5 条评论 评论 复制 LinkedIn Facebook X 关闭菜单 分享 60 5 条评论 评论 分享 复制 LinkedIn Facebook X 关闭菜单
Taras Tsugrii

GenAI builder

5,517 位关注者 2 个月 举报此动态 关闭菜单

Most people hear “3-bit quantization” and imagine each value now costs 3 bits. In practice, that is often false. The real cost is: effective bits/value = payload bits + metadata overhead At very low bitwidths, that hidden overhead can dominate. That is why the most interesting part of Google’s TurboQuant (https://lnkd.in/g5FNc7Gv) writeup is not just “more compression for LLMs.” It is the deeper systems idea: below a certain bitwidth, side information stops being a detail and starts becoming the format. Traditional low-bit vector quantization usually carries extra baggage: scales, norms, block constants, codebooks, normalization metadata. Those bits are amortized, but they are still real. If the payload is 3 bits/value and the side information costs another 1-2 bits/value, then the representation is not really 3-bit in the only sense hardware cares about: bytes stored and bytes moved. The geometric part is what makes this elegant. For attention and vector search, the goal is not perfect reconstruction of each coordinate. It is preserving the downstream computation, especially inner products. That is a subtle but important shift. If you randomly rotate a vector before quantizing it, you often spread its energy more evenly across coordinates. In geometric terms, the data becomes less spiky in the chosen basis and more isotropic. Quantization gets easier because no single coordinate carries too much structure. Intuitively: a bad coordinate system makes the cloud look stretched and fragile a better one makes it look rounder and easier to compress That is why the PolarQuant angle is interesting. Cartesian quantization asks: which box did this point land in? A polar-style view asks: how far from the origin is it, and in what direction is it pointing? For similarity-heavy workloads, that second question is often closer to the computation that actually matters. Then comes the part I especially like: don’t spend every extra bit on making the first approximation uniformly better. Spend most bits on a strong coarse representation, then use a tiny residual budget to remove systematic error. This generalizes far beyond AI. In any compressed system, the representation is never just the data. It is the data plus everything required to interpret it. Once the payload gets small enough, the interpretation overhead becomes the main event. That is the real lesson I took from TurboQuant: extreme compression only becomes real when you compress the hidden structure around the values, not just the values themselves.

…展开 查看 C2PA 信息 无上一项内容 查看 C2PA 信息 无下一项内容 Taras Tsugrii

GenAI builder

Most people hear “3-bit quantization” and imagine each value now costs 3 bits. In practice, that is often false. The real cost is: effective bits/value = payload bits + metadata overhead At very low bitwidths, that hidden overhead can dominate. That is why the most interesting part of Google’s TurboQuant (https://lnkd.in/g5FNc7Gv) writeup is not just “more compression for LLMs.” It is the deeper systems idea: below a certain bitwidth, side information stops being a detail and starts becoming the format. Traditional low-bit vector quantization usually carries extra baggage: scales, norms, block constants, codebooks, normalization metadata. Those bits are amortized, but they are still real. If the payload is 3 bits/value and the side information costs another 1-2 bits/value, then the representation is not really 3-bit in the only sense hardware cares about: bytes stored and bytes moved. The geometric part is what makes this elegant. For attention and vector search, the goal is not perfect reconstruction of each coordinate. It is preserving the downstream computation, especially inner products. That is a subtle but important shift. If you randomly rotate a vector before quantizing it, you often spread its energy more evenly across coordinates. In geometric terms, the data becomes less spiky in the chosen basis and more isotropic. Quantization gets easier because no single coordinate carries too much structure. Intuitively: a bad coordinate system makes the cloud look stretched and fragile a better one makes it look rounder and easier to compress That is why the PolarQuant angle is interesting. Cartesian quantization asks: which box did this point land in? A polar-style view asks: how far from the origin is it, and in what direction is it pointing? For similarity-heavy workloads, that second question is often closer to the computation that actually matters. Then comes the part I especially like: don’t spend every extra bit on making the first approximation uniformly better. Spend most bits on a strong coarse representation, then use a tiny residual budget to remove systematic error. This generalizes far beyond AI. In any compressed system, the representation is never just the data. It is the data plus everything required to interpret it. Once the payload gets small enough, the interpretation overhead becomes the main event. That is the real lesson I took from TurboQuant: extreme compression only becomes real when you compress the hidden structure around the values, not just the values themselves.

…展开 31 2 条评论 评论 复制 LinkedIn Facebook X 关闭菜单 分享 31 2 条评论 评论 分享 复制 LinkedIn Facebook X 关闭菜单
Uday Kamath, Ph.D.

Building Industry-First AI in Regulated Industries | 8x Author AI Books(LLMs, RL, XAI) | Keynote Speaker |

8,363 位关注者 2 个月 举报此动态 关闭菜单

Google Research's latest paper at ICLR 2026 (TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate) tackles one of AI's most expensive infrastructure problems. At 128K context, the KV cache alone costs 33 GB of GPU memory per user. TurboQuant compresses it to 3 bits: 6× less memory, 8× faster attention on H100, zero accuracy loss, no retraining. The trick: one random rotation makes every key vector follow the same predictable distribution. No per-vector statistics. No overhead. Just math. I break down the paper, the math, and my own implementation on Google Colab. https://lnkd.in/etuwmnAE #LLM #MachineLearning #AIInfrastructure

…展开 TurboQuant: Google Just Solved the KV Cache Bottleneck substack.com 33 1 条评论 评论 分享 复制 LinkedIn Facebook X 关闭菜单
Performance Optimization Techniques的更多内容 A/b Testing Strategies for Better Results Advanced LLM Parameter Tuning Techniques AI-Based Load Planning Systems Amazon A10 Ranking Optimization Strategies Amazon Dsp Performance Improvement Strategies Amazon Engineering Strategies for Fast-Paced Execution API Performance Optimization Techniques Applying an Engineering Mindset to Performance Optimization Benefits of Caching Techniques Best Strategies for Effective Memory Management Best Techniques for High-Performance Computing Boosting LLM Performance Using Local Data Layers Boosting LLM Performance Using P2L Methods Capacity Allocation Strategies for Optimal Resource Management Cargo Weight Distribution Strategies Commercial Solar Performance Analysis Techniques Common Pytorch Memory Management Strategies CRO Testing Methods to Accelerate Results in 2025 CX and EX Strategies for High Performance Data-Driven Load Optimization Deploying Local LLMs for Reliable Performance Diffusion Models for Robotics Performance Optimization Dynamic Load Scheduling Algorithms Embedded Solutions for Improved Performance Error Budget Strategies for Performance Management Error Mitigation Strategies in Quantum Computing Holistic System Analysis for Optimizing Energy Output How Data Structures Affect Programming Performance How Indexing Improves Query Performance How IOWN Technology Improves Data Center Performance How Llms Boost Performance How to Achieve Fast Data Transmission How to Address Human Needs for Optimal Performance How to Address Performance Drops How to Analyze Database Performance How to Apply Optimization Techniques in Practice How to Boost Pipeline Performance How to Boost Web App Performance How to Deploy Llms for Optimal Performance How to Embrace REST for Improved Performance How to Ensure App Performance How to Improve AI Performance With New Techniques How to Improve Code Performance How to Improve NOSQL Database Performance How to Improve Page Load Speed How to Improve Telecom Cabinet Performance How to Improve Well Performance How to Maintain IT System Performance How to Maximize GPU Utilization How to Optimize Application Performance How to Optimize Cloud Database Performance How to Optimize Cloud Resource Provisioning How to Optimize Data Serialization How to Optimize Data Streaming Performance How to Optimize Digital Shelf Performance How to Optimize Embedded System Performance How to Optimize Images for Website Speed How to Optimize Performance Using Cuda How to Optimize Postgresql Database Performance How to Optimize Pyspark Job Performance How to Optimize Pytorch Performance How to Optimize Query Strategies How to Optimize Search Using Embeddings How to Optimize SQL Server Performance Importance of Process Optimization in Data Centers Improve LCP, INP, and CLS for Web Performance 2025 Improving Data Center Performance Beyond Marketing Claims Improving Data Center Profitability and Network Performance Improving Energy System Performance with Near-Optimal Solutions Improving LLM Performance Using Open-Source Approaches Improving Quantum Subsystem Performance for Faster Results Improving Solar Panel Performance for Small Systems Improving UAS Mission Performance in Multiple Sectors Integrated Load Management Approaches Key Drivers of Solar PLF Performance Key Performance Testing Strategies Key Strategies for Service Optimization Key Techniques for Achieving High Throughput LLM Fine-Tuning Strategies for Multi-Domain Applications LLM Memory Profiling Strategies for Design Space Exploration LLM Strategies for Human-Level Performance Load Balancing Techniques for Optimal Performance Load Capacity Utilization Strategies Load Consolidation for Cost Savings Load Flexibility Enhancement Techniques Load Prioritization Frameworks Load Testing Strategies That Deliver Results Maintenance Strategies for Optimal Performance Memory Optimization Strategies Mental Techniques to Improve Performance Methods to Compare Solar String Performance Multi-GPU Parallelism Techniques Multi-Model Strategies for LLM Performance Optimizing LLM Output Using APO Techniques Optimizing Quantum Model Performance for Professionals Optimizing Robotics Performance with Smaller Components Optimizing Test Systems for Better Performance Overcoming Scaling Issues in Quantum Numerical Methods Performance Improvement Strategies Proactive Load Adjustment Strategies Production Optimization Methods for Field Operators Resource-Efficient Load Management Resource Optimization Strategies Rest Strategies for High Performers in 2025 Run Time Optimization in Solar Site Operations Signal Stacking Strategies for Better Results Simple ERP Optimization Techniques Smart Load Allocation Algorithms Solar Farm Network Performance Strategies Stanford Method for Improving Open LLM Performance Stochastic Optimization Methods Strategies for Improving Fusion Reactor Performance Strategies for Improving Midstream Oil & Gas Performance Strategies for Optimizing Analytical Methods Strategies for Optimizing Models Strategies for Quantum Circuit Execution in Noisy Environments Strategies for Results-Driven Energy Management Strategies to Address EV Performance Challenges Strategies to Address Operational Inefficiencies Strategies to Boost BAL 2025 Performance Strategies to Improve Delivery Performance Strategies to Improve Inverter Performance Strategies to Improve IT Infrastructure Performance Strategies to Improve Physical Performance Consistency Strategies to Improve String Handling in Algorithms Strategies to Optimize Feed-to-Weight Conversion Ratio Strategies to Prevent Network Bandwidth Bottlenecks in 2025 Streamlining Engineering While Maintaining Performance Sustainable Load Management Practices Techniques for Solar Plant Performance Assessment Techniques to Boost XR Performance and Realism Techniques to Streamline Large Language Model Performance Testing Methods for Scaling LLM Performance Tips for Cloud Optimization Strategies Tips for Database Performance Optimization Tips for Optimizing Apache Spark Performance Tips for Optimizing App Performance Testing Tips for Optimizing Images to Improve Load Times Tips for Optimizing LLM Performance Tips for Performance Optimization in C++ Tips for Real-Time Performance Tracking Tips to Improve Performance in .Net Tips to Improve Spark Job Execution Speed Using I-V Curve Tracing for Solar PV Optimization Using Models for Energy Performance Analysis Wind Load Performance Analysis 展开 收起 浏览分类 Hospitality & Tourism Finance Soft Skills & Emotional Intelligence Project Management Education Technology Leadership Ecommerce User Experience Recruitment & HR Customer Experience Real Estate Marketing Sales Retail & Merchandising Science Supply Chain Management Future Of Work Consulting Writing Economics Artificial Intelligence Employee Experience Healthcare Workplace Trends Fundraising Networking Corporate Social Responsibility Negotiation Communication Engineering Career Business Strategy Change Management Organizational Culture Design Innovation Event Planning Training & Development 展开 收起 领英 © 2026 关于 无障碍模式 用户协议 隐私政策 Cookie 政策 版权政策 品牌政策 访客设置 社区准则 العربية (阿拉伯语) বাংলা (孟加拉语) Čeština (捷克语) Dansk (丹麦语) Deutsch (德语) Ελληνικά (希腊语) English (英语) Español (西班牙语) فارسی (波斯语) Suomi (芬兰语) Français (法语) हिंदी (印地语) Magyar (匈牙利语) Bahasa Indonesia (印尼语) Italiano (意大利语) עברית (希伯来语) 日本語 (日语) 한국어 (韩语) मराठी (马拉地语) Bahasa Malaysia (马来语) Nederlands (荷兰语) Norsk (挪威语) ਪੰਜਾਬੀ (旁遮普语) Polski (波兰语) Português (葡萄牙语) Română (罗马尼亚语) Русский (俄语) Svenska (瑞典语) తెలుగు (泰卢固语) ภาษาไทย (泰语) Tagalog (他加禄语) Türkçe (土耳其语) Українська (乌克兰语) Tiếng Việt (越南语) 简体中文 (简体中文) 正體中文 (繁体中文) 关闭菜单 语言

Quantization Techniques for Large-Scale Data Processing,AI智能索引,全网链接索引,智能导航,网页索引

    \n Master quantization techniques for efficient large-scale data processing. Explore rotational and product quantization for better memory use and faster…