Advancing Enterprise AI: How Wix is Democratizing RAG Evaluation

Wix Engineering
Jun 30, 2025
4 min read

The Challenge

Retrieval-Augmented Generation (RAG) has emerged as one of the most important design patterns in modern AI systems, particularly for enterprises building customer support chatbots and knowledge base applications.

While RAG enables AI systems to provide accurate, up-to-date information by coupling language models with external knowledge bases, many organizations struggle with a critical challenge: evaluating and improving these complex systems.

Unlike simple factual questions, enterprise RAG systems must handle multi-step procedures, domain-specific terminology, and queries requiring information synthesis from multiple sources - all while maintaining transparency and trust.

Traditional evaluation methods fall short in enterprise environments, relying on Wikipedia-style datasets that don't reflect real customer support complexities. Most existing benchmarks can't handle the messy realities of domain-specific queries, and evaluation frameworks typically provide scores without actionable guidance for improvement.

This creates a fundamental problem: how do you debug and enhance complex multi-stage RAG pipelines when you don't understand which components are failing or why?

Open-Sourcing Enterprise AI

Today, we're introducing two significant contributions:

WixQA - A comprehensive benchmark suite built from real enterprise customer support interactions
RAGXplain - A framework that transforms evaluation metrics into explainable insights and actionable recommendations

We're sharing our tools because building reliable AI is too big for any company to tackle alone. That's why we're open-sourcing everything: our research papers, complete datasets, evaluation frameworks, and the specific prompts we use. We're making these tools free to use (MIT license) because we believe everyone should be able to build better AI systems. When researchers and developers can share their work openly, we all benefit from faster progress and better solutions.

Building AI that actually works for businesses means creating systems that can explain their answers, that users can trust, and that solve real problems. No single company can figure this out alone - we need to work together.

1. WixQA: A Real-World Enterprise RAG Benchmark

WixQA: A Multi-Dataset Benchmark for Enterprise Retrieval-Augmented Generation represents a comprehensive benchmark suite designed specifically for enterprise RAG evaluation.

Unlike academic datasets built from Wikipedia or general knowledge sources, WixQA is grounded in real customer support interactions from Wix.

The benchmark consists of three datasets plus a synchronized knowledge base snapshot, all available on HuggingFace:

WixQA-ExpertWritten: 200 genuine customer queries paired with detailed, step-by-step answers manually authored and validated by domain experts. These represent the complex, multi-faceted problems that real customers face, such as configuring domain settings, troubleshooting SSL certificates, or setting up e-commerce functionality.
WixQA-Simulated: 200 expert-validated question-answer pairs distilled from actual multi-turn conversations between users and support chatbots. This dataset captures the essence of real support interactions while providing clean, single-turn QA pairs for systematic evaluation.
WixQA-Synthetic: 6,221 LLM-generated question-answer pairs systematically derived from each article in the Wix knowledge base, providing the scale and comprehensive coverage necessary for training robust models.

What makes WixQA particularly valuable is its multi-article dependency feature - unlike traditional benchmarks where each question maps to a single source document, many WixQA answers require synthesizing information from multiple knowledge base articles.

This reflects the reality of enterprise support, where solving a customer's problem often requires pulling together information from various documentation sources.

The entire benchmark, including the complete knowledge base of 6,221 Wix help articles that answers were derived from, is available under an MIT license, making it freely accessible to researchers and practitioners worldwide.

Download the dataset: WixQA on HuggingFace.

2. RAGXplain: From Scores to Actionable Insights

Traditional RAG evaluation tools tell you what went wrong with abstract scores like "Context Adherence: 0.4" but leave you guessing about why it happened and how to fix it.

RAGXplain's breakthrough contribution is transforming these opaque numbers into clear, human-readable explanations and specific action items that teams can immediately implement.

When RAGXplain evaluates your system using six complementary metrics (Context Relevancy, Context Adherence, Answer Relevancy, Context Recall, Factuality, and Grading Note), it doesn't just report scores - it explains the story behind them.

Instead of seeing "Context Adherence: 0.4" you get: "The AI is ignoring retrieved documents and generating answers from internal knowledge. This happened because your generation prompt doesn't emphasize using the provided context. Fix it by adding this exact phrase: 'Base your response strictly on the provided references and do not use external knowledge.'"

The framework's real power lies in its ability to synthesize patterns across your entire dataset and translate them into prioritized, actionable recommendations. For instance, when Context Recall consistently scores below 0.6 across multiple queries, RAGXplain identifies this as "insufficient document retrieval" and provides specific guidance: "Increase your retrieval parameter k from 5 to 10 to capture more relevant information."

Each recommendation includes the problem explanation, why it occurred, a concrete example from your data, and step-by-step implementation instructions.

Early adoption results show that these tools can significantly accelerate development cycles by eliminating guesswork and providing clear guidance on what needs fixing and exactly how to address it.

Getting Started

Both WixQA and RAGXplain are designed as practical tools that teams can integrate immediately. The WixQA datasets come with comprehensive documentation and baseline results. RAGXplain integrates into existing evaluation pipelines with minimal setup - it evaluates 200 records for under $1.00 and completes in about 38 seconds.

For teams ready to improve their RAG systems, we recommend:

Establish baselines by testing your existing systems against WixQA
Use RAGXplain (prompts included in the paper) to identify the highest-impact improvement opportunities
Share findings with the community to help build a richer ecosystem of evaluation tools

With WixQA and RAGXplain, enterprise teams can finally evaluate and improve their RAG systems with confidence. The tools are ready - start building better AI applications today

Access the research papers: WixQA and RAGXplain on arXiv. Download the datasets: WixQA on HuggingFace.