Practical Guidance for A/B Testing

Effective A/B tests require focusing on strategic questions, building reliable systems, and embedding learning into organizational culture. This guide shares practical lessons from supporting organizations in adopting A/B testing.

Key Takeaways

Focus A/B testing on strategic questions that improve cost-effectiveness
Keep tools and systems simple, aligned with existing infrastructure
Start small and build capabilities iteratively through practice

Building Organizational Capacity for A/B Testing

Adopting A/B testing is not just a technical task. It is about building the mindset, processes, and capabilities that allow your organization to test, learn, and improve continuously. In our work with tech-forward organizations¹, we have learned that effective A/B testing works best when it focuses on strategic questions, relies on evidence-based design, is supported by reliable systems, and is embedded in a culture of continuous learning. Without these, even well-run tests may fail to drive meaningful program improvements.

This section offers practical guidance for teams seeking to build organizational capacities for implementing A/B tests that generate actionable insights.

Focus on Learning That Matters

A/B testing should help your team answer questions that lead to better programs or products. To do this, anchor A/B testing in a theory of change and cost-effectiveness lens. Focus on testing changes that can meaningfully improve impact, not just minor user experience (UX) tweaks.

Anchor Testing in Theory of Change

Use your Theory of Change to identify:

Strategic levers that drive impact
Early outcomes that signal progress
Components where there is uncertainty or room for improvement

This ensures tests focus on improving pathways to impact rather than making superficial changes.

Use Cost-Effectiveness as a Lens

Prioritize tests that can improve:

Reach: Expanding access to target populations
Efficacy: Strengthening outcomes per user
Efficiency: Reducing costs while maintaining impact

This framework helps teams focus experimentation on changes that move the needle on program cost-effectiveness.

A/B Testing is Not Always the Right Tool

A/B testing requires experiment infrastructure, engineering effort, and user exposure, so it’s important to use it for questions where the potential insights justify the investment. When outcomes are predictable or the expected learning value is low, other methods may be more appropriate and less costly.

When to Consider Alternatives

Imagine an EdTech platform is considering whether the interface color has an effect on student learning outcomes. In this case, running an A/B test is not the best approach. Instead, doing a survey to understand user preferences and reviewing existing human-computer interaction literature would be more appropriate and cost-effective methods to inform this decision before investing in an A/B test.

Consider These Alternatives

Before implementing major changes:

Conduct user research to understand needs and preferences
Review existing evidence from similar contexts
Run small pilots to test feasibility
Use prototyping to validate design concepts

For low-stakes decisions:

Make the change and monitor with existing data systems
Gather qualitative feedback from users
Review usage analytics for patterns

Keep Tools and Systems Simple

A/B testing works best when your tools integrate smoothly with your existing systems. Prioritize tools that work well with your current data and engineering setup. Simple integrations reduce costs and errors in data collection and analysis. Keep in mind that tools with strong documentation and support will be easier to maintain over time.

Tool Selection Criteria

When selecting experimentation tools, consider:

Compatibility

How well does this integrate with our current tech stack?
Can it work with our existing data collection systems?
Does it support our preferred programming languages or frameworks?

Cost-effectiveness

What are the total costs (licensing, hosting, maintenance)?
Are there open-source alternatives available?
What is the cost-benefit ratio given our expected testing volume?

Features and functionality

Does it support our required testing methodologies?
Can it handle our scale and user base?
Does it provide adequate analysis and reporting tools?

Support and documentation

Is documentation comprehensive and accessible?
What level of technical support is available?
Is there an active user community?

Available Tools

Consider these options:

Open-source non-profit solutions: Evidential, UpGrade
Commercial platforms: Various SaaS options with different feature sets
In-house development: Custom solutions for specific needs

Effective Monitoring Requires Upfront Planning

Good A/B testing depends on good data. Before launching a test, define what data will be collected and from which sources, and ensure that your data systems can support the analysis you will need.

Plan Your Data Infrastructure

Before running tests, ensure you can:

Capture all necessary user interactions
Track key outcome metrics reliably
Store data securely and accessibly
Link data across systems as needed

Define Metrics Early

Establish clear definitions for:

Primary outcome metrics: The main indicators of success
Secondary metrics: Supporting indicators that explain results
Safety metrics: Indicators of potential harm or system issues
Integrity metrics: Checks on randomization and implementation

Design Data Structures

Create data schemas that:

Support the unit of analysis (user, session, etc.)
Include all variables needed for analysis
Enable efficient querying and visualization
Maintain data quality standards

Align Teams on Metrics and Technical Aspects

Align teams (program, MEL, data) on metric definitions and data needs to ensure consistent collection and valid interpretation of results. Clarify roles for data collection, monitoring, and analysis.

Cross-Team Alignment

Program teams should:

Define what success looks like
Identify priority areas for testing
Specify implementation constraints

MEL teams should:

Design valid experiments
Define measurement approaches
Ensure data quality standards

Data/tech teams should:

Build required infrastructure
Implement randomization systems
Create monitoring dashboards

Establish Clear Roles

Define who is responsible for:

Experiment design and approval
Technical implementation
Data collection and quality checks
Monitoring and analysis
Decision-making based on results

Start Small to Learn

A/B testing is a capability that grows with practice. Begin with A/A tests to verify that randomization and data flows work correctly. Run simple, low-stakes A/B tests first to build team confidence. Use early tests as learning opportunities to refine your processes.

Build Capabilities Gradually

Phase 1: Infrastructure validation

Run A/A tests to verify systems work
Check data collection pipelines
Validate randomization procedures
Test monitoring dashboards

Phase 2: Simple experiments

Test low-risk variations
Focus on clear, measurable outcomes
Use shorter test durations
Limit scope to reduce complexity

Phase 3: Strategic experiments

Test higher-impact changes
Run longer, more complex tests
Experiment with multiple variations
Build sophisticated analysis capabilities

Learn from Each Cycle

After each test:

Document what worked well
Identify challenges and bottlenecks
Refine processes for next time
Share learnings across teams

Ensure Tests are Valid and Reliable

For results to be trustworthy, check that test groups are balanced in size and across key contextual variables, and verify that the intervention was delivered as intended in each group.

Verify Randomization

Check that:

Group sizes match planned allocation (e.g., 50-50 split)
Users were assigned randomly, not systematically
No user received both treatments
Assignment remained stable throughout test

Confirm Group Comparability

Verify that groups are balanced on:

Demographics (age, gender, location, etc.)
Prior behavior (engagement levels, usage patterns)
Context variables (device type, access conditions)
Any other factors that might affect outcomes

Monitor Implementation Fidelity

Ensure that:

Each group received the intended variation
No contamination occurred between groups
Technical delivery worked as designed
Timing and exposure matched plans

Interpret Results for Practical Value, Not Just Significance

When analyzing results, focus on whether observed differences are meaningful for program or product decisions, not just whether they are statistically significant.

Look Beyond P-Values

Statistical significance tells you if an effect is unlikely due to chance, but doesn’t tell you:

If the effect size matters practically
If implementation costs are justified
If effects will persist over time
If results will generalize to other contexts

Consider Practical Significance

Ask yourself:

Is this improvement worth the cost?
Will this change meaningfully affect user outcomes?
Can we maintain this intervention at scale?
Does this align with our strategic priorities?

Examine the Full Picture

Look at:

Effect sizes and confidence intervals
Secondary metrics and user segments
Implementation challenges encountered
Qualitative feedback from users
Cost-benefit considerations

Moving Forward

A/B testing is a powerful method when applied thoughtfully and with a clear learning agenda. Success depends on selecting the right questions, building reliable data systems, ensuring data quality, and embedding learning into decision-making.

Organizations should view A/B testing as a capability to be developed iteratively. By starting small, building alignment across teams, and investing in systems and structured processes, organizations can turn A/B testing into a sustainable driver of program and product improvement. With practice, A/B testing can become a key part of how organizations deliver greater impact.

References

Abdul Latif Jameel Poverty Action Lab. (n.d.). Quick guide to power calculations. https://www.povertyactionlab.org/resource/quick-guide-power-calculations

Angrist, N., Beatty, A., Cullen, C., & Matsheng, M. (2024). A/B testing in education: Rapid experimentation to optimise programme cost-effectiveness. What Works Hub for Global Education. Insight Note 2024/001. https://doi.org/10.35489/BSGWhatWorksHubforGlobalEducation-RI_2024/001

Gugerty, M. K., & Karlan, D. (2018). The Goldilocks challenge: Right-fit evidence for the social sector. Oxford University Press. https://doi.org/10.1093/oso/9780199366088.001.0001

IPA Right-Fit Evidence Unit. (2024, October). Enabling stage-based learning: A funder’s guide to maximize impact. https://poverty-action.org/sites/default/files/2024-10/Enabling-Stage-Based-Learning-Full-Guide.pdf

Kasy, M., & Sautmann, A. (2021). Adaptive treatment assignment in experiments for policy choice. Econometrica, 89(1), 113-132.

Kelly, K., Arroyo, I., & Heffernan, N. (2013). Using ITS generated data to predict standardized test scores. In Proceedings of the 6th International Conference on Educational Data Mining (pp. 3–4). https://www.educationaldatamining.org/EDM2013/papers/rn_paper_62.pdf

Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy online controlled experiments: A practical guide to A/B testing. Cambridge University Press.

Kohavi, R., Deng, A., & Vermeer, L. (2022). A/B testing intuition busters: Common misunderstandings in online controlled experiments. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’22) (pp. 3168–3177). Association for Computing Machinery. https://doi.org/10.1145/3534678.3539160

Muñoz-Merino, P. J., Ruipérez-Valiente, J. A., & Delgado-Kloos, C. (2013). Inferring higher level learning information from low level data for the Khan Academy platform. In Proceedings of the Third International Conference on Learning Analytics and Knowledge (LAK ’13) (pp. 112–116). ACM Press. https://doi.org/10.1145/2460296.2460318

Ruipérez-Valiente, J. A., Muñoz-Merino, P. J., & Delgado-Kloos, C. (2018). Improving the prediction of learning outcomes in educational platforms including higher level interaction indicators. Expert Systems. Advance online publication. https://doi.org/10.1111/exsy.12298

Vanacore, K., Ottmar, E., Liu, A., & Sales, A. (2024). Remote monitoring of implementation fidelity using log-file data from multiple online learning platforms. Journal of Research on Technology in Education. Advance online publication. https://doi.org/10.1080/15391523.2024.2303025

These learnings have emerged through work with partner organizations, Mentu and Tirando x Colombia, during the development of their A/B testing systems.↩︎

Building Organizational Capacity for A/B Testing

Focus on Learning That Matters

Anchor Testing in Theory of Change

Use Cost-Effectiveness as a Lens

A/B Testing is Not Always the Right Tool

Consider These Alternatives

Keep Tools and Systems Simple

Tool Selection Criteria

Compatibility

Cost-effectiveness

Features and functionality

Support and documentation

Available Tools

Effective Monitoring Requires Upfront Planning

Plan Your Data Infrastructure

Define Metrics Early

Design Data Structures

Align Teams on Metrics and Technical Aspects

Cross-Team Alignment

Establish Clear Roles

Start Small to Learn

Build Capabilities Gradually

Phase 1: Infrastructure validation

Phase 2: Simple experiments

Phase 3: Strategic experiments

Learn from Each Cycle

Ensure Tests are Valid and Reliable

Verify Randomization

Confirm Group Comparability

Monitor Implementation Fidelity

Interpret Results for Practical Value, Not Just Significance

Look Beyond P-Values

Consider Practical Significance

Examine the Full Picture

Document and Share Learnings

Create Documentation Standards

Share Knowledge

Build Institutional Memory

Moving Forward

References