Multi-role E2E test architecture for Ideony — deep research
Multi-role E2E test architecture for Ideony — deep research
Section titled “Multi-role E2E test architecture for Ideony — deep research”Date: 2026-04-19 Status: research-only, no implementation Feeds into: future spec for Phase E multi-role test harness Authors: Claude (Opus 4.7, deep-research mode)
Executive summary
Section titled “Executive summary”Ideony is a two-sided marketplace with real-time coordination semantics (SOS dispatch + booking state machine + chat + GPS tracking). Single-actor E2E tools (Maestro, Playwright, Detox) cover individual flows well but leave the hardest coordination bugs — race conditions on concurrent UPDATE booking SET status=…, WebSocket delivery guarantees, cascade dispatch winner-selection, GPS staleness, push-notification ordering — completely untested.
After auditing how Uber, Airbnb, DoorDash, Bolt, Glovo, and Amazon handle this problem and comparing every commercial AI E2E tool against an OSS-custom approach, this report recommends:
- Build, don’t buy. No commercial tool (QAWolf, Momentic, TestRigor, Reflect, Mabl) can orchestrate two Expo clients + shared BE state + WebSocket assertions + GPS streaming on CI at a price Ideony can absorb pre-revenue. They are all built for single-user web flows; multi-role is either absent or emulated via brittle multi-tab hacks.
- Pattern: Maestro per device + Playwright for web + thin Node.js orchestrator over Postgres + Redis as coordination substrate. This mirrors DoorDash’s multi-tenancy model, Uber’s Composable Testing Framework (CTF), and Bolt’s simulator — at SaaS-startup scale.
- State model is the hard part, not the runner. Every SOTA company converged on a centralized, introspectable test state (Uber CTF trip object, DoorDash “DoorTest” tenant, Bolt city simulator ticks). Ideony must expose a
/test/stateendpoint and deterministic seed+freeze controls before any harness is worth writing. - Layer the testing pyramid. 60% of “multi-role” bugs can be caught with WebSocket integration tests using two
socket.io-clientinstances against a real NestJS gateway — no mobile runner, 100× faster, deterministic. Reserve full orchestrated multi-device runs for the 6 canonical scenarios.
Budget estimate: 3-4 weeks to build, ~€0/month ongoing (self-hosted ARM64 runner already exists). Alternative (QAWolf managed): ~€8-15k/yr + still can’t do multi-device orchestration without custom glue. OSS route wins on both cost and capability.
Section 1 — SOTA pattern deep-dive
Section titled “Section 1 — SOTA pattern deep-dive”1.1 Uber — Composable Testing Framework (CTF) + BITS
Section titled “1.1 Uber — Composable Testing Framework (CTF) + BITS”Source of truth: Uber Blog, “Shifting E2E Testing Left at Uber” (2024-08-22) + DPE Summit 2024 talk by Daniel Tsui & Quess Liu + Signadot deep-dive.
Stack
- Composable Testing Framework (CTF) — internal code-level DSL, JVM-based, where every test action is a pure function over a centralized trip state object. E.g.
rider.requestRide(),driver.accept(),driver.arrive()— each mutates a sharedTripStatesnapshot. - BITS (Backend Integration Testing Strategy) — Cadence-orchestrated workflow engine that provisions ephemeral sandboxes that route test traffic through production services via OpenTelemetry baggage-based context propagation. Tests run against real prod downstreams, but side-effects are scoped to
test-accounttenancy. - Cadence (now Temporal) for sandbox lifecycle
- Jaeger trace indices to measure endpoint coverage per test
Multi-actor pattern The forward dispatch example in their blog is the canonical reference: a driver receives a new pickup before completing the prior drop. CTF represents both actors as operations on shared trip state. The framework ensures serialization — driver.accept() cannot run until rider.requestRide() emits the matching event.
Shared state / fixtures
Test tenancy propagated via OpenTelemetry baggage → all Kafka topics, DB writes, RPC calls carry tenant=test-acc-123 → real production routing infrastructure reroutes those to sandboxed data stores. No DB seeding per se; production is the seed.
WebSocket / real-time Covered by context propagation — dispatch events carry the tenant tag, dispatch service routes them to the sandbox WS gateway. Test asserts on trip state after event, not on raw WS frames.
GPS simulation Internal “rider/driver simulator” — proprietary, not open-sourced. Drives location updates via mocked GPS feed into the dispatch engine.
Push notifications Captured via sandboxed push bus (APNS/FCM test accounts); assertion looks for push payload delivery to the test tenant’s notification queue.
Visual regression Not primary — Uber’s testing is backend-heavy. Mobile visual testing uses standard Espresso/XCUITest snapshots.
Cost / CI time Several thousand E2E tests run pre-merge. Individual test runtime amortized via Cadence parallelism. Pass rate 90%+ per attempt, boosted to 99.9% via retry. Reduced incidents/1000 diffs by 71% in 2023.
Takeaway for Ideony
Uber’s pattern is a microservice-scale solution; Ideony is a monolith. What transfers: the state-first test DSL idea. Write tests as actions-on-booking-state, not as UI click sequences. Ideony’s equivalent of TripState is BookingState + SOSDispatchState.
1.2 Airbnb — Cypress + Happo + Ruby integration + one-shard-per-PR
Section titled “1.2 Airbnb — Cypress + Happo + Ruby integration + one-shard-per-PR”Sources: Happo.io case studies, Better Android Testing at Airbnb - Part 2 by Eli Hart, multiple Airbnb Tech Blog posts.
Stack
- Cypress for web E2E
- Happo (external SaaS) for visual regression — screenshots uploaded async, keyed by git SHA, PR-blocking diff review
- Espresso + Happo for Android; approach-testing mock variants × screens → Happo diff
- Internal “Happyhost” simulator (mentioned in various talks) — simulated backend for deterministic E2E seeding
- Jest for unit
Multi-actor pattern
Airbnb has a lighter multi-actor requirement than Uber — host-vs-guest interactions are mostly async messaging, not real-time. Their pattern uses two Cypress browser contexts (via cy.session() with different cookie jars) + DB fixture seeding. One test file, two personas, synchronization via polling a shared mock inbox.
Shared state Database fixtures — Rails+Postgres seed snapshots keyed to test scenarios. Each test resets to a known state.
WebSocket / real-time Minimal — messaging is push/email primarily. Where WS exists (inbox updates), tested at the service level via RSpec + stub adapters.
Visual regression Happo is the crown jewel. Async bitmap upload + cross-browser (Chrome, Firefox, Safari, Edge, iOS Safari) parallelization. Eli Hart’s Android article describes how every mock variant × every screen = a screenshot. PR build posts diff comment; reviewers approve visual changes as part of code review.
CI time / cost Happo priced competitively against Percy (cheaper, per their marketing; exact pricing gated behind sales). Cypress shards per PR on internal infra.
Takeaway for Ideony Already using Lost Pixel for visual regression — correct pattern, cheaper alternative. Two-Cypress-contexts pattern doesn’t transfer directly (Ideony is mobile-first), but the persona seed + polling sync approach does transfer to two-device Maestro orchestration.
1.3 DoorDash — Multi-tenancy in production + “DoorTest” guardrails
Section titled “1.3 DoorDash — Multi-tenancy in production + “DoorTest” guardrails”Source: Moving e2e testing into production with multi-tenancy + Drive Delivery Simulator docs.
Stack
- Kotlin backend + Kafka
- Multi-tenant gRPC interceptor model — every request carries a tenant header; test tenants live alongside prod tenants
- Internal UI tool + gRPC service for devs to spawn test users (consumer, dasher), simulate geolocation, create test stores, assign test orders
- Delivery Simulator — public dev portal tool that advances an order through states (Created → Dasher Confirmed → Arrived at Pickup → Picked Up → Arrived at Dropoff → Delivered) without dispatching real dashers
Multi-actor pattern Test consumer places order at a test store (real stores don’t accept test orders; test stores don’t accept real orders — enforced by tenant guardrail). Test dasher picks up. All state transitions exercise the real dispatcher, payment, and notification services, but side-effects (money, SMS, actual deliveries) are routed to no-op sinks.
Shared state No seeding — production is the environment. Guardrails enforce isolation. Test scenarios are reliably reproducible because the tooling programmatically creates test users, addresses, stores.
WebSocket / real-time Driver app receives dispatch via the same real-time channel; tenancy header determines routing. Tests assert on state-machine transitions, not frame-level WS.
GPS Test user address simulation baked into the internal tool — set lat/long per user, dispatcher uses it for matching.
Takeaway for Ideony
Multi-tenancy in production is overkill for Ideony’s stage. But the DoorTest tooling idea is directly applicable: build a /test/scenarios admin API that seeds a named scenario ("SOS_BURST_PIPE_ROME") → pre-created consumer + 3 pros within 10km + fake booking state. Multi-role test just calls POST /test/scenarios/sos_burst_pipe_rome then drives two actors through the flow.
1.4 Bolt — City simulator + SimPy-style event loop
Section titled “1.4 Bolt — City simulator + SimPy-style event loop”Source: Simulating cities for a better ride-hailing experience at Bolt (Aug 2019, still referenced).
Stack
- Python-based city simulator
- OSRM for map routing
- Trained ETA + matching probability models
- Event-driven tick loop (simulated heartbeat)
- Real backend matching/pricing/dispatching algorithms plugged in as black boxes
Multi-actor pattern Not strictly E2E — Bolt’s simulator is algorithm validation more than product validation. Generates N virtual riders + N drivers per city, runs them through a day’s worth of events, measures aggregate KPIs (avg pickup time, cancellation rate, utilization). Before deploying a new matching algorithm, run it through the sim; compare to baseline.
Shared state Agents (riders, drivers) are in-process Python objects. Historical order data seeds realistic arrival rates.
Real-time assertions None — simulator is offline/batch. Real-time E2E happens separately via Maestro + Appium for UI flows.
Takeaway for Ideony Pre-launch Ideony doesn’t have historical data yet — a city simulator is premature. But the event-driven tick loop is the right abstraction once volume exists. Post-v1, a simulator of 50 pros + 20 consumers in Rome could stress-test the SOS cascade dispatch algorithm. Mark as Phase F+ (not MVP, not Phase E).
1.5 Glovo — Jarvis + SimPy simulator + “teswiz”-style multi-platform E2E
Section titled “1.5 Glovo — Jarvis + SimPy simulator + “teswiz”-style multi-platform E2E”Sources: How to Simulate a Global Delivery Platform, Glovo career pages citing Appium + Kotlin.
Essentially identical to Bolt: Python SimPy simulator for algorithm research, separate Appium + Kotlin E2E pipeline for product validation. Jarvis (their dispatcher) is tested in the sim using event-based simulation with trained probability models.
Takeaway for Ideony Same as Bolt — defer the algorithm simulator. Adopt the Glovo hiring-page stack (Appium + OOP test framework + Grafana/Sentry logs) as a floor target, but substitute Maestro for Appium (equivalent for our purposes, much lighter setup).
1.6 Amazon — Selenium Grid + synthetic canaries + infrequent multi-actor
Section titled “1.6 Amazon — Selenium Grid + synthetic canaries + infrequent multi-actor”Sources: Balancing the Test Pyramid the AWS way + AWS Builder Library (public re:Invent talks).
Stack
- Selenium Grid + Appium for UI
- Synthetic canaries (CloudWatch Synthetics or internal equivalent) running in production every minute
- Heavy reliance on pre-prod canary deploys — code is tested in prod with tiny blast radius rather than in staging
Multi-actor pattern Rare. Seller-buyer interactions are typically async (seller lists, buyer buys days later). When needed, separate test envs per persona with pre-loaded state rather than live orchestration.
Takeaway for Ideony
Synthetic canaries in prod post-launch = excellent. Run the booking-completion scenario every 5 minutes against prod (w/ tenant=synthetic) → monitor success rate. This is a post-v1 concern, not Phase E.
1.7 Cross-cutting pattern summary
Section titled “1.7 Cross-cutting pattern summary”Distilled across all six:
| Dimension | Dominant pattern |
|---|---|
| Runner | Whatever single-user runner matches the platform (Cypress for web, Espresso/XCUITest for native, Appium/Maestro for React Native) |
| Orchestration | Node.js or Kotlin or Python glue script in CI |
| Shared state | Centralized mutable test-state object exposed via API OR multi-tenant production routing |
| Real-time sync | Poll /test/state OR subscribe to a tap on the event bus (Redis pub/sub, Kafka topic) |
| GPS | Simulator injects lat/long through the real dispatch service; never through mobile OS APIs at E2E layer |
| Visual | Async screenshot SaaS (Happo, Percy, Chromatic, Applitools, Lost Pixel) keyed by SHA |
| Fault injection | Separate chaos runs, not interleaved with functional E2E |
Nobody uses AI-authored tools at SOTA scale for multi-actor flows. The AI tools market is aimed at QA-team-light startups that want to replace manual testers, not at engineering-driven orgs that already write test code.
Section 2 — AI-powered E2E tools comparison
Section titled “Section 2 — AI-powered E2E tools comparison”2.1 QAWolf
Section titled “2.1 QAWolf”Sources: qawolf.com, Bug0 review, QAWolf pay-per-test pricing post (2026-01-19).
- Model: Managed service. Humans + AI author Playwright/Appium tests. Charged per test per month (undisclosed per-test price, anchored around “roughly half an in-house QA engineer”). Example cited: 400-800 tests total per mid-size app.
- Multi-user: Playwright supports it via multiple browser contexts; QAWolf can author such tests. Real-time multi-mobile orchestration: no explicit support.
- Mobile: Web + iOS + Android via Appium; no React Native specialization.
- CI: GitHub Actions supported. Webhook on deploy triggers suite.
- SOTA verdict: Would Airbnb use it? No — Airbnb has its own test engineers. Would a 5-person YC startup use it? Yes, to offload QA entirely.
- Ideony fit: 4/10. Cost unknown but likely €500-2000/mo given “half a QA engineer” anchor. Covers single-user flows well. For SOS dispatch / real-time / GPS / WebSocket assertions — would need custom Playwright code they author on your behalf, losing the “managed” value prop. Lock-in risk: tests run on their infra.
2.2 Momentic
Section titled “2.2 Momentic”Sources: momentic.ai/enterprise, trendingaitools.com Momentic review, Bug0 Momentic review.
- Pricing (2026-04): Starter free (50 runs/mo, 1 env); Pro $99/mo (1000 runs/mo); Business custom. $15M Series A by Standard Capital (recent).
- Multi-user: Web only. No multi-mobile orchestration. Chrome extension records flows → AI generates self-healing selectors. Cannot coordinate two simultaneous contexts natively; user has to script that themselves.
- Mobile: Web-only as of this writing (flagged in reviews; “mobile support pending”).
- CI: CI/CD webhooks, GitHub Actions. 99.99% uptime SLA, SOC2 Type 2.
- Verdict for Ideony: 3/10. Disqualified by mobile-only limitation. Ideony is Expo-first; web is a secondary target. Even for the web preview, multi-role coordination is DIY.
2.3 TestRigor
Section titled “2.3 TestRigor”Sources: testRigor FAQ, stackpick pricing.
- Pricing: Free for public tests; $300/mo private (lowest paid tier). Pricing scales with parallelization units, not tests-per-month (favorable for large suites, unfavorable for small ones).
- Multi-user: Explicitly advertised. FAQ mentions “multiple users to interact via email, sms, or instant messages.” Plain-English test DSL:
"login as user1", "send message 'hello' to user2", "verify user2 receives 'hello'". - Mobile: Web + mobile via internal runners. Native iOS/Android less mature than web.
- CI: REST API trigger from GitHub Actions.
- Verdict for Ideony: 5/10. Plain-English spec is attractive for non-technical cofounder to author tests. But €300+/mo + black-box infrastructure + unknown maturity of multi-role feature for our specific (Expo + WebSocket + GPS) shape makes it risky. Would require a paid pilot before committing. Lock-in is high (tests are in testRigor’s DSL, not portable).
2.4 Applitools Eyes
Section titled “2.4 Applitools Eyes”Source: applitools.com/pricing, Visual Sentinel 2026 comparison.
- Pricing: Free 100 checkpoints/mo; paid starts ~$899/mo (cited for 1000 checkpoints). No public mid-tier pricing.
- Not a functional E2E tool. Adds visual + Ultrafast Grid (cross-browser) on top of Playwright/Cypress/Appium.
- Multi-user: N/A — it’s a visual layer, orthogonal to orchestration.
- Verdict for Ideony: 2/10 as a multi-role solution (not its purpose). 6/10 as a potential Lost Pixel upgrade if visual complexity grows. Ideony already uses Lost Pixel (free, self-hosted); no reason to switch pre-revenue.
2.5 Mabl
Section titled “2.5 Mabl”Sources: saascounter.net pricing survey, vendor comparisons.
- Pricing: ~$250-450/mo starting (low-code, AI-assisted). Not self-serve — requires sales call.
- Multi-user: Limited; primarily single-flow low-code.
- Mobile: Web + limited mobile.
- Verdict: 3/10. Enterprise-oriented, pricing opaque, no natural fit for mobile-first real-time product.
2.6 Reflect.run
Section titled “2.6 Reflect.run”Source: reflect.run/pricing.
- Pricing: Team $225/mo (web+API, 500 credits/mo), Premium contact sales. Mobile testing is a paid add-on, “private mobile” tier is Enterprise.
- Multi-user: Documented only as separate tests; no orchestration primitives.
- Verdict: 3/10. Similar profile to Mabl — web-first, mobile bolted on.
2.7 Checkly
Section titled “2.7 Checkly”Source: scanlyapp 2026 Checkly alternatives.
- Pricing: Free (10k API runs/mo); Team $64/mo (100k API runs, 12k browser runs).
- Positioning: Monitoring-as-code, not E2E test authoring. Playwright scripts in git, run on schedule against prod.
- Multi-user: Playwright’s multi-context capability available since tests are raw Playwright code.
- Verdict for Ideony: 7/10 as a post-launch synthetic monitoring solution — run the booking happy-path every 5 minutes in prod, alert on failure. Not a multi-role development tool but a complement.
2.8 Autify
Section titled “2.8 Autify”Public info: web + native app via AI-assisted test recorder, $99/mo starter, $450/mo Pro. No meaningful multi-role support beyond single-user flows.
Verdict: 3/10.
2.9 Functionize, DevCycle
Section titled “2.9 Functionize, DevCycle”- Functionize: Enterprise AI testing, no public pricing, no multi-role emphasis. 2/10.
- DevCycle: Feature flag platform — not an E2E tool. Mis-categorized in the brief. Could be used for test-gating (flag-on-for-test-tenant) but that’s orthogonal. N/A.
2.10 Ranking for Ideony (1-10)
Section titled “2.10 Ranking for Ideony (1-10)”| Tool | Ideony score | Reasoning |
|---|---|---|
| Checkly | 7 | Great for post-launch synthetic monitoring; not primary E2E tool |
| TestRigor | 5 | Explicit multi-user support, but opaque infra + $300/mo + lock-in |
| QAWolf | 4 | Managed quality but cost unknown, multi-mobile not their strength |
| Applitools | 3 | Orthogonal (visual only); already have Lost Pixel |
| Mabl / Reflect / Autify | 3 | Web-biased, mobile bolted on |
| Momentic | 3 | Web-only; mobile promised-but-absent |
| Functionize | 2 | Enterprise-only, no pricing transparency |
None are >7. No AI tool offers a compelling out-of-the-box solution for Ideony’s specific cocktail: Expo (iOS+Android+Web from same codebase) + Socket.IO gateway + GPS streaming + two-actor coordination + Italian-first locale.
Section 3 — Build-vs-buy matrix
Section titled “Section 3 — Build-vs-buy matrix”3.1 Ideony’s requirements
Section titled “3.1 Ideony’s requirements”- 2 mobile clients (consumer + pro) ± optional 3rd (admin web)
- Real BE (NestJS + Postgres + Redis + Socket.IO), only Stripe/Clerk/Novu in test mode
- Deterministic state seeds via existing
pnpm seed:demo+ future/test/scenarios/:name - 6 canonical scenarios (defined in
project_multi_role_e2e.md) - IT + EN locales
- Expo (iOS/Android/Web) from single codebase
- GPS streaming (SOS dispatch)
- WebSocket delivery assertions (chat, dispatch, tracking)
- Push-notification timing (Novu → Resend/Twilio/Expo Push)
3.2 Decision matrix
Section titled “3.2 Decision matrix”| Dimension | OSS custom orchestrator | TestRigor | QAWolf managed | Mabl / Autify |
|---|---|---|---|---|
| Initial eng cost | 3-4 wk senior eng | 1 wk integration + learning curve | 1 wk onboarding | 1 wk onboarding |
| Monthly cost | €0 (self-hosted ARM64 runner exists) | €300-900 | €500-2000 est. | €250-450 |
| Annual cost (yr 1) | ~€0 + time sunk | ~€5-11k | ~€6-24k | ~€3-5.5k |
| Multi-mobile orchestration | Full control | Limited | Limited | Web-biased |
| GPS simulation | Full control via backend API | Blocked (no runner access) | Via custom Playwright | Via custom Playwright |
| WebSocket assertion | Direct socket.io-client | Via their DSL (uncertain fidelity) | Via custom Playwright | Limited |
| Debuggability | 100% (our code, our logs) | Dashboard-based | Dashboard-based | Dashboard-based |
| Lock-in | None | High (their DSL) | Medium (Playwright artifacts exported) | High |
| Adding 7th scenario | ~1 day | ~1 day in DSL | Request new test (billed) | ~1 day |
| Skill on team | TS, Node, Socket.IO — already have | Plain English — anyone | None needed | None needed |
| Pre-revenue startup fit | Excellent | Marginal | Poor | Poor |
3.3 Recommendation
Section titled “3.3 Recommendation”Build. Rationale:
- Ideony already has the infrastructure: Maestro license (free tier, ARM64 runner), Playwright installed, NestJS backend with
/test/*endpoints possible, Socket.IO gateway already instrumented. - The 3-week eng cost front-loads cleanly into Phase E; AI tools shift cost to monthly burn without solving the multi-role problem fully.
- Debuggability matters more than anything when multi-role tests fail at 2am — our code, our logs, our traces beat a black-box SaaS dashboard every time.
- Exit cost: if in 2 years we want to migrate to QAWolf, our Maestro flows and Playwright specs are portable (they’re the same artifacts QAWolf would author).
Caveat: If cofounders want cofounder-level QA (i.e. non-technical PM writes tests), TestRigor becomes more attractive. But that’s a product-org decision, not a technical one.
Section 4 — Recommended architecture
Section titled “Section 4 — Recommended architecture”4.1 Stack
Section titled “4.1 Stack”| Layer | Tool | Version | Role |
|---|---|---|---|
| Mobile single-user runner | Maestro | 2.4.0+ (CLI) | Per-device YAML flows, 122 flows exist |
| Web single-user runner | Playwright | @playwright/test 1.48+ | Web E2E, browser contexts for secondary multi-user |
| Native iOS-specific fallback | Detox | 20.x | Only where Maestro can’t reach (rare) |
| Orchestrator | Node.js + TypeScript + Vitest | Node 22 LTS + Vitest 2.x | Multi-role coordinator |
| Event bus (test) | Redis pub/sub | 8.6 (existing) | Sync primitive between runners |
| WebSocket client (integration layer) | socket.io-client | 4.8+ | Direct multi-client assertions |
| State control | NestJS /test/scenarios module | New, gated by NODE_ENV=test + auth header | Deterministic seeds |
| Visual | Lost Pixel (existing) | — | No change |
| GPS simulation | Custom via BE /test/geo-feed endpoint | New | Inject GPS into dispatch service, not into mobile OS |
| CI | GitHub Actions + self-hosted ARM64 runner | existing | Orchestrates all above |
| Observability | Sentry (BE) + run artifacts (screenshots, videos, WS transcripts) | existing Sentry + new artifact bundler | Failure triage |
4.2 Harness design
Section titled “4.2 Harness design”Layered pyramid (bottom = fastest, broadest; top = slowest, narrowest):
+---------------------------+ | Multi-device Maestro | <-- 6 canonical scenarios | orchestrated (CI only) | ~2-5 min each +---------------------------+ +-----------------------------+ | Playwright 2-context web | <-- subset for web | multi-role (local+CI) | ~30s each +-----------------------------+ +-----------------------------------+ | Multi-client socket.io-client | <-- bulk of coordination | integration tests (no UI) | ~1-3s each +-----------------------------------+ +---------------------------------------+ | Single-service NestJS unit + e2e | <-- existing 300+ BE tests +---------------------------------------+Crucially: 80% of multi-role bugs can be caught at the “multi-client socket.io-client” layer without a single mobile runner. Two node processes, real BE, real Postgres, real Redis — assert on WS frames + DB state. Only the top 2 layers need the heavy orchestration.
4.3 Shared state strategy
Section titled “4.3 Shared state strategy”Three seed mechanisms, used in combination:
pnpm seed:demo— persistent demo data (existing). Baseline: 20 pros, 5 consumers, Rome-Milan-Turin, realistic distribution. Run once per CI job startup.POST /test/scenarios/:name— scenario-specific superposition. E.g.sos_burst_pipe_romecreates consumertest-c-1, protest-p-1/2/3each at specific lat/long within 10km of consumer, pending booking in stateCREATED. Idempotent; wipes prior test-scoped rows for that name.POST /test/cleanup— end-of-test sweep. Deletes rows tagged withtest_tenant=<uuid>; each multi-role test generates a fresh tenant ID at start.
DB isolation: all test rows get a test_tenant column (nullable for prod data). Middleware adds WHERE test_tenant = $1 OR test_tenant IS NULL to read queries during tests. Hard guardrail: deletes under test tenant require matching tenant header.
Clock control: POST /test/clock/advance?seconds=300 — advances a centralized mocked clock (ClockService.now() abstraction already partially exists for booking reminders). Avoids real 5-minute waits in SOS countdown tests.
Deterministic randomness: seeded RNG exposed via CryptoService. Test mode sets a known seed via POST /test/seed.
4.4 Synchronization primitive
Section titled “4.4 Synchronization primitive”The critical design choice. Options considered:
- Polling REST — test A polls
GET /test/state/booking/:iduntilstatus=ACCEPTED. Simple, reliable, slow (1s granularity). Chosen for happy-path sync. - Redis pub/sub tap — test harness subscribes to
events:bookings:*channel, awaits specific message. Low-latency (<50ms), deterministic. Chosen for timing-sensitive sync (GPS tracking, chat delivery). - Shared WS bus — test spawns its own socket.io-client, subscribes to dispatch events. Chosen for assertions about what pros receive (multi-client WS test layer).
Rule: polling for state convergence; pub/sub tap for event-fired assertions.
4.5 Sample scenario walkthrough — Scenario 1
Section titled “4.5 Sample scenario walkthrough — Scenario 1”“Consumer search → pro match → book → pay → complete”
File layout:
test/e2e-multi-role/ scenarios/ 01-booking-full-cycle/ scenario.spec.ts <-- orchestrator entry consumer.flow.yaml <-- Maestro flow for consumer app pro.flow.yaml <-- Maestro flow for pro app assertions.ts <-- shared DB + WS assertions fixtures/ scenario-seed.json <-- extra seed overrides lib/ orchestrator.ts <-- spawns + coordinates Maestro processes test-api.ts <-- wraps /test/* endpoints ws-tap.ts <-- Redis + socket.io listener helpers sync.ts <-- waitForState, waitForEvent helpers artifacts.ts <-- collects screenshots, logs, ws-transcripts on failureShape of scenario.spec.ts (illustrative, not runnable):
import { describe, it, beforeAll, afterEach } from 'vitest';import { Orchestrator } from '../../lib/orchestrator';import { TestApi } from '../../lib/test-api';import { waitForBookingStatus, waitForWsEvent } from '../../lib/sync';
describe('Scenario 01: Booking full cycle', () => { let orch: Orchestrator; let api: TestApi; let tenantId: string;
beforeAll(async () => { api = new TestApi(); tenantId = await api.createTenant(); await api.seedScenario('booking_full_cycle_rome', tenantId); orch = new Orchestrator({ consumer: { device: process.env.CONSUMER_DEVICE_UDID!, appId: 'app.ideony.consumer' }, pro: { device: process.env.PRO_DEVICE_UDID!, appId: 'app.ideony.pro' }, }); });
afterEach(async () => { if (context.failed) await orch.collectArtifacts(); await api.cleanupTenant(tenantId); });
it('consumer books, pro accepts, both complete', async () => { // Phase 1: consumer searches and books await orch.runFlow('consumer', 'consumer-search-and-book.yaml', { TENANT_ID: tenantId, EXPECTED_PRO: 'test-p-1', });
const bookingId = await api.getLatestBookingId(tenantId);
// Phase 2: assert booking exists in PENDING_ACCEPTANCE + pro got push await waitForBookingStatus(api, bookingId, 'PENDING_ACCEPTANCE', { timeoutMs: 5000 }); await waitForWsEvent('booking:new', { bookingId, proId: 'test-p-1' });
// Phase 3: pro accepts via their app await orch.runFlow('pro', 'pro-accept.yaml', { TENANT_ID: tenantId, BOOKING_ID: bookingId }); await waitForBookingStatus(api, bookingId, 'ACCEPTED');
// Phase 4: simulate clock advance to scheduled time await api.advanceClock(3600); // 1hr forward
// Phase 5: pro marks arrived, then completed await orch.runFlow('pro', 'pro-arrive-complete.yaml', { BOOKING_ID: bookingId }); await waitForBookingStatus(api, bookingId, 'COMPLETED');
// Phase 6: consumer sees receipt + review prompt await orch.runFlow('consumer', 'consumer-verify-completion.yaml', { BOOKING_ID: bookingId });
// Phase 7: final state asserts const booking = await api.getBooking(bookingId); expect(booking.status).toBe('COMPLETED'); expect(booking.paymentCapturedAt).toBeDefined(); });});Key properties:
- Each
orch.runFlowspawns a freshmaestro test --device <udid> --env TENANT_ID=…and awaits exit code. - Between Maestro invocations, Node-level assertions + BE API calls — fast, deterministic.
waitForWsEventsubscribes to Redisevents:bookings:*channel, resolves on match or timeout.- Tenant isolation means scenarios can run in parallel on separate emulator pairs.
4.6 CI workflow
Section titled “4.6 CI workflow”# .github/workflows/e2e-multi-role.yml (shape only, not production code)name: E2E Multi-roleon: [pull_request, workflow_dispatch]jobs: orchestrated: runs-on: [self-hosted, arm64, macos] strategy: fail-fast: false matrix: scenario: [01-booking, 02-sos, 03-cancel, 04-chat, 05-credentials, 06-rating] timeout-minutes: 15 steps: - uses: actions/checkout@v5 - run: pnpm install --frozen-lockfile - run: pnpm docker:up # Postgres + Redis + MinIO + Mailpit - run: pnpm --filter @ideony/api migrate deploy - run: pnpm --filter @ideony/api seed:demo - run: pnpm --filter @ideony/api start:test-mode & # exposes /test/* endpoints - run: ./scripts/boot-emulator-pair.sh # 2 Android emulators - run: pnpm --filter @ideony/mobile build:test-apk - run: pnpm --filter @ideony/mobile install:test-apks - run: pnpm test:multi-role -- --scenario ${{ matrix.scenario }} - if: failure() uses: actions/upload-artifact@v4 with: name: artifacts-${{ matrix.scenario }} path: test/e2e-multi-role/artifacts/Runtime budget: each scenario 2-5 min. 6 scenarios × parallel workers (2 concurrent on ARM64 runner w/ 4 emulators) = ~15 min wall time. Acceptable gate for PRs touching apps/api/src/modules/{booking,sos,credentials} or apps/mobile.
Flakiness mitigation:
- Retries: 2 retries per scenario in CI (but log + surface flake counts to Sentry)
- Placebo tests (Uber’s trick): duplicate run of a scenario with no code change → measure raw flake rate per scenario
- Screenshot + video + WS-transcript + BE-log bundle on failure, posted as PR comment
- Emulator snapshots per-run rather than per-session (avoid accumulated state)
4.7 Observability on failure
Section titled “4.7 Observability on failure”The failure triage loop is the make-or-break of multi-role testing. Design:
- Correlation ID per scenario run —
X-Test-Run-Id: <uuid>header on every request, stamped onto every NestJS log line viaClsModule, threaded into Sentry scope. - WS transcript — orchestrator’s own socket.io-client logs every frame it sees, dumped to
artifacts/<run-id>/ws-transcript.ndjson. - Screenshots — Maestro
--screenshot-on-failure+ explicittakeScreenshotat key assertion points. - Video — Maestro cloud video recording (free tier for up to N runs; self-host
screenrecordas fallback). - BE log bundle — on test failure, orchestrator calls
GET /test/logs?run_id=<uuid>which returns structured logs from all BE services for that correlation ID. - DB snapshot — on failure, pg_dump the test-tenant rows for post-mortem.
- Sentry replay — already configured for mobile; replay ID linked into failure artifact bundle.
Failure comment shape on PR:
Scenario 02 (SOS dispatch) failed at step 4: waitForBookingStatus timeout after 10s — expected ACCEPTED, got DISPATCHING. Artifacts: WS transcript • Consumer screenshot • Pro screenshot • BE logs • DB snapshot • Sentry event
4.8 Rollout phases (refined)
Section titled “4.8 Rollout phases (refined)”Replaces the project_test_coverage_next_steps.md plan, retains its Phase E numbering:
-
E0 — Prerequisites (3 days)
/test/scenarios/:nameendpoint (NestJS module, gated byprocess.env.TEST_MODE === '1'+ shared secret header)ClockServiceabstraction withadvanceClocktest-only methodtest_tenantcolumn + middleware- Orchestrator scaffolding (
lib/orchestrator.ts,lib/test-api.ts,lib/sync.ts)
-
E1 — Single-user refresh (1 wk)
- Refresh existing 122 Maestro flows against current main (Sole palette + NativeWind + SOS gateway changes)
- Add onboarding wizard flow once PR ad665e05 lands
-
E2 — WebSocket integration layer (3 days) — highest ROI step
- Multi-client
socket.io-clientintegration tests for chat delivery, booking state broadcast, SOS cascade dispatch - Runs in same NestJS test harness as BE unit tests; no mobile runners
- Multi-client
-
E3 — Scenario 1 (booking) + 5 (credentials) orchestrated (1 wk)
- Exercises full orchestrator without the most complex features (no GPS streaming)
-
E4 — Scenario 2 (SOS full cycle) orchestrated (1 wk)
- Most complex:
/test/geo-feedendpoint for GPS injection, clock advance through countdown, cascade dispatch winner assertion
- Most complex:
-
E5 — Scenarios 3, 4, 6 + CI wiring + Lost Pixel refresh cadence (1 wk)
- Complete the canonical 6
-
E6 — Post-launch: synthetic canaries via Checkly (Free tier) against prod — deferred until v1 stable
Total: ~4 weeks of focused eng work. Monthly cost €0 (Maestro CLI + Playwright + Lost Pixel all free tiers). Optional €64/mo Checkly after launch.
Section 5 — Open questions for brainstorming
Section titled “Section 5 — Open questions for brainstorming”Before locking the architecture, these need user input:
-
Test tenancy vs ephemeral DB? The recommended pattern uses
test_tenantcolumn + middleware on a shared DB. Alternative: spin up a fresh Postgres viadocker composeper CI job (heavier startup, cleaner isolation). Preference? -
GPS injection — backend API or device-level? Recommended: backend
/test/geo-feedthat bypasses mobile GPS entirely, pro app receives lat/long via dispatch WS event. Alternative: Android mock-location + Xcode simulator GPS. Backend approach is faster/simpler but tests less of the real GPS→app path. Acceptable tradeoff? -
Clock mocking scope?
ClockService.now()abstraction touches Booking reminders, SOS countdown, reservation expiry. Retrofitting the whole codebase is ~1 day’s work. Alternative: scenario-specific fake clock via request header. The latter is hackier but isolates blast radius. Which? -
Cofounder-authored tests? TestRigor’s plain-English DSL would let non-technical cofounders write tests. Worth the €300+/mo + lock-in cost? Or keep tests eng-only?
-
Parallel scenario execution — 2 or 4 emulator pairs? ARM64 Mac mini runner handles 2 pairs comfortably. 4 pairs would need a second runner (~€50/mo Hetzner). Worth it for wall-time reduction, or accept 15-min gate?
-
Visual regression on multi-role scenarios? Lost Pixel integrated at
assertVisiblepoints inside flows, OR a dedicated non-multi-role visual pass? The latter is cheaper and avoids flake amplification but misses visual regressions that only appear during cross-user flows (e.g. chat bubbles rendering). -
Maestro Cloud vs local orchestration? Maestro Cloud offers parallel device farm for $$. Self-hosted saves money but caps parallelism. Start self-hosted, escalate if scenario 6+ times out?
-
Fail-fast policy? On first scenario failure in a matrix run, abort remaining scenarios to save CI minutes? Or always run all 6 for full diagnostic? Recommend the latter despite cost (6×2min = 12 emulator-minutes is cheap; diagnostic value is high).
-
Real Stripe/Clerk/Novu in tests? Stripe test-mode + Clerk dev tenant + Novu dev env have rate limits. Hitting them per-PR could throttle CI. Recommend: mock Clerk JWT (fixed signing key in test mode), use Stripe test cards, Novu dev env with retries. Is that acceptable, or should we mock all three externals entirely?
-
Multi-role post-launch — extend to 3-actor (consumer + pro + admin approving credentials live)? Scenario 5 as currently scoped has admin approval offline. A 3-actor version would validate the full admin-review flow but needs a 3rd emulator or a Playwright browser for the admin web UI. Scope in Phase E or defer?
Sources cited
Section titled “Sources cited”SOTA companies:
- Uber Blog, “Shifting E2E Testing Left at Uber” (2024-08-22) — https://www.uber.com/blog/shifting-e2e-testing-left/
- DPE Summit 2024 talk (Gradle) — https://www.youtube.com/watch?v=Lat87fBTShQ
- Signadot, “Shifting E2E Left on Microservices” (2024-10-21) — https://medium.com/@signadot/shifting-end-to-end-testing-left-on-microservices-e3c6b0adf2cb
- DoorDash Engineering, “Moving e2e testing into production with multi-tenancy” (2022-03-03) — https://doordash.engineering/2022/03/03/moving-e2e-testing-into-production-with-multi-tenancy-for-increased-speed-and-reliability/
- DoorDash Delivery Simulator docs — https://developer.doordash.com/docs/drive_classic/how_to/use_delivery_simulator/
- Bolt Labs, “Simulating cities for a better ride-hailing experience” (2019-08-15) — https://medium.com/bolt-labs/simulating-cities-for-a-better-ride-hailing-experience-at-bolt-f97af9190ada
- Glovo Engineering, “How to Simulate a Global Delivery Platform” (2021-12-30) — https://medium.com/glovo-engineering/how-to-simulate-a-global-delivery-platform-7aa5fa475d88
- Airbnb Tech Blog, “Better Android Testing at Airbnb — Part 2: Screenshot Testing” — https://medium.com/airbnb-engineering/better-android-testing-at-airbnb-a77ac9531cab
- Happo.io docs + testimonials — https://happo.io/
- LambdaTest, “Balancing the Test Pyramid the AWS way” — https://lambdatest.com/blog/balancing-the-test-pyramid-the-aws-way
Tooling:
- Maestro docs — https://maestro.dev/, https://docs.maestro.dev/
- Maestro multi-device issue #2957 — https://github.com/mobile-dev-inc/Maestro/issues/2957
- Maestro parallel testing insights (2026-03-11) — https://maestro.dev/insights/parallel-testing-android-ios
- Playwright BrowserContext docs — https://playwright.dev/docs/api/class-browsercontext
- Playwright isolation guide — https://playwright.dev/docs/browser-contexts
- OneUptime, “How to Configure Playwright Parallel Execution” (2026-01-28) — https://oneuptime.com/blog/post/2026-01-28-playwright-parallel-execution/view
- Detox docs — https://wix.github.io/Detox/
- Appium Pro, “Testing Real-Time User Interaction Using Multiple Simultaneous Appium Sessions” — https://appiumpro.com/editions/118-testing-real-time-user-interaction-using-multiple-simultaneous-appium-sessions
- Socket.IO testing docs (v4) — https://socket.io/docs/v4/testing/
- tests.ws, “WebSocket Testing Best Practices” (2026-02-13) — https://tests.ws/testing/websocket-testing-best-practices
AI tool pricing + reviews:
- QAWolf pricing post (2026-01-19) — https://www.qawolf.com/blog/qa-wolf-is-reinventing-qa-pricing
- Bug0, “QA Wolf Pricing” — https://bug0.com/knowledge-base/qa-wolf-pricing
- Momentic Enterprise page — https://momentic.ai/enterprise
- Momentic review — https://bug0.com/knowledge-base/momentic-review
- Momentic on trendingaitools — https://www.trendingaitools.com/ai-tools/momentic-ai-2/
- TestRigor FAQ — https://testrigor.com/faq
- TestRigor pricing (Stackpick) — https://stackpick.net/pricing/testrigor/
- Applitools pricing — https://applitools.com/pricing
- Reflect.run pricing — https://reflect.run/pricing/
- Checkly alternatives 2026 — https://scanlyapp.com/blog/checkly-alternatives-2026
- Visual Sentinel, “Visual Regression Testing Setup 2026” — https://visualsentinel.com/blog/how-to-set-up-visual-regression-testing-2026
GPS simulation:
- BrowserStack Appium GPS docs — https://www.browserstack.com/docs/app-automate/appium/test-real-user-conditions/simulate-gps-location
- Detox
setLocationPR #3479 — https://github.com/wix/Detox/pull/3479 - iMobie GPS simulation dev guide — https://www.imobie.com/location-change/simulating-gps.htm
End of report.