Follow @DigEventHorizon |
Orca-AgentInstruct, from Microsoft Research, can generate diverse, high-quality synthetic data at scale to post-train and fine-tune base LLMs for expanded capabilities, continual learning, and increased performance. The post Orca-AgentInstruct: Agentic flows can be effective synthetic-data generators appeared first on Microsoft Research. This discrepancy may be attributed to the challenge of generating high-quality and diverse synthetic data. Successful use of synthetic data involves significant human effort in curating and filtering the data to ensure high quality. Synthetic data meets agents: Another major development we witnessed during the past year is the rise of agentic (especially multi-agent) workflows, such as with AutoGen. Agentic workflows can generate high-quality data, which surpasses the capabilities of the underlying LLMs, by using flows with reflection and iteration that enable agents to look back at solutions, generate critiques, and improve solutions. They can also use tools like search APIs, calculators, and code interpreters to address LLM limitations. Multi-agent workflows bring in additional benefits as well, such as simulating scenarios where we can generate both new prompts and the corresponding responses. They also enable automation of data-generation workflows, reducing or eliminating the need for unnecessary human intervention on some tasks. AgentInstruct: Generating synthetic data for post-training or finetuning often relies on an existing prompt set that is either used as is or as seeds for generating more instructions. In this work, we generalize the problem settings to a broader objective of generating an abundant amount of diverse, challenging, and high-quality data to teach a particular skill to an AI model. We refer to this setting as generative teaching. AgentInstruct is an agentic solution for generative teaching. AgentInstruct uses raw documents as input to create demonstration and feedback data. When generic data is used as seeds, AgentInstruct can be used to teach an LLM a general capability, such as writing, reasoning, or retrieval-augmented generation (RAG). Domain specific data, like retail or finance, can also be used as seeds to improve the model in a certain specialization. AgentInstruct can create: High-quality data: AgentInstruct uses GPT-4, coupled with tools like search and code interpreters, to create high-quality data. Diverse data: AgentInstruct creates prompts and responses using a set of specialized agents (with powerful LLMs, tools, and reflection flows) and a taxonomy (of more than 100 subcategories), , ensuring diversity and quality. Large quantities of data: AgentInstruct can run autonomously. and applyiflows for verification and data filtering. It does not require seed prompts and uses raw documents for seeding. Using raw data as seeds offers two advantages: it is plentiful, allowing AgentInstruct to generate large-scale and diverse datasets, and it encourages learning general skills instead of benchmark-specific ones by avoiding using existing prompts. Spotlight: Event Series Microsoft Research Forum Join us for a continuous exchange of ideas about research in the era of general AI. Watch the first four episodes on demand. Register for series Opens in a new tab We anticipate agentic flows becoming increasingly important throughout the model-training lifecycle, including pre-training, post-training, and specialization, and ultimately enabling the creation of a synthetic data factory for model customization and continuous improvement. This has the potential to drive AI advances across multiple industries by making high-quality model training more efficient and accessible. Contributors: Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei-ge Chen, Olga Vrousgou, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, and Ahmed Awadallah Opens in a new tabThe post Orca-AgentInstruct: Agentic flows can be effective synthetic-data generators appeared first on Microsoft Research.
Published: 2024-11-14T17:00:00
Follow @DigEventHorizon |