DeepSpeed-FastGen Performance and Feature Enhancements

DeepSpeed-FastGen is an inference system framework that enables easy, fast, and affordable inference for large language models (LLMs). From general chat models to document summarization, and from autonomous driving to copilots at every layer of the software stack, the demand to deploy and serve these models at scale has skyrocketed. DeepSpeed-FastGen utilizes the Dynamic SplitFuse technique to tackle the unique challenges of serving these applications and offer higher effective throughput than other state-of-the-art systems.

Fast and Expressive LLM Inference with RadixAttention and SGLang

Large Language Models (LLMs) are increasingly utilized for complex tasks that require multiple chained generation calls, advanced prompting techniques, control flow, and interaction with external environments. However, there is a notable deficiency in efficient systems for programming and executing these applications. To address this gap, we introduce SGLang, a Structured Generation Language for LLMs.

OpenCompass LLM Leaderboard

OpenCompass is an advanced benchmark suite featuring three key components: CompassKit, CompassHub, and CompassRank. CompassRank has been significantly enhanced to incorporate both open-source and proprietary benchmarks. CompassHub presents a pioneering browser interface, designed to simplify and expedite the exploration and utilization of an extensive array of benchmarks for researchers and practitioners alike.

DeepSeek MoE

DeepSeekMoE 16B is a Mixture-of-Experts (MoE) language model with 16.4B parameters. It employs an innovative MoE architecture, which involves two principal strategies: fine-grained expert segmentation and shared experts isolation. It is trained from scratch on 2T English and Chinese tokens, and exhibits comparable performance with DeekSeek 7B and LLaMA2 7B, with only about 40% of computations.

Fine-tune Mistral with Direct Preference Optimization

Pre-trained Large Language Models (LLMs) can only perform next-token prediction, making them unable to answer questions. This is why these base models are then fine-tuned on pairs of instructions and answers to act as helpful assistants. However, this process can still be flawed: fine-tuned LLMs can be biased, toxic, harmful, etc.

Reasoning with Foundation Models

We organize the current foundation models into three categories: language foundation models, vision foundation models, and multimodal foundation models. Further, we elaborate the foundation models in reasoning tasks, including commonsense, mathematical, logical, causal, visual, audio, multimodal, agent reasoning, etc. Reasoning techniques, including pre-training, fine-tuning, alignment training, mixture of experts, in-context learning, and autonomous agent, are also summarized.

RAG Survey

Large language models (LLMs) have become an integral part of our lives and work, transforming how we interact with information through their astonishing versatility and intelligence. Despite their impressive capabilities, they are not without flaws. These models can produce misleading “hallucinations,” rely on potentially outdated information, be inefficient when dealing with specific knowledge, lack depth in specialized fields, and fall short in reasoning abilities.

Mixture of Experts Explained

With the release of Mixtral 8x7B (announcement, model card), a class of transformer has become the hottest topic in the open AI community: Mixture of Experts, or MoEs for short. In this blog post, we take a look at the building blocks of MoEs, how they’re trained, and the tradeoffs to consider when serving them for inference.