Multi-role E2E test architecture for Ideony — deep research

Date: 2026-04-19 Status: research-only, no implementation Feeds into: future spec for Phase E multi-role test harness Authors: Claude (Opus 4.7, deep-research mode)

Executive summary

Ideony is a two-sided marketplace with real-time coordination semantics (SOS dispatch + booking state machine + chat + GPS tracking). Single-actor E2E tools (Maestro, Playwright, Detox) cover individual flows well but leave the hardest coordination bugs — race conditions on concurrent UPDATE booking SET status=…, WebSocket delivery guarantees, cascade dispatch winner-selection, GPS staleness, push-notification ordering — completely untested.

After auditing how Uber, Airbnb, DoorDash, Bolt, Glovo, and Amazon handle this problem and comparing every commercial AI E2E tool against an OSS-custom approach, this report recommends:

Build, don’t buy. No commercial tool (QAWolf, Momentic, TestRigor, Reflect, Mabl) can orchestrate two Expo clients + shared BE state + WebSocket assertions + GPS streaming on CI at a price Ideony can absorb pre-revenue. They are all built for single-user web flows; multi-role is either absent or emulated via brittle multi-tab hacks.
Pattern: Maestro per device + Playwright for web + thin Node.js orchestrator over Postgres + Redis as coordination substrate. This mirrors DoorDash’s multi-tenancy model, Uber’s Composable Testing Framework (CTF), and Bolt’s simulator — at SaaS-startup scale.
State model is the hard part, not the runner. Every SOTA company converged on a centralized, introspectable test state (Uber CTF trip object, DoorDash “DoorTest” tenant, Bolt city simulator ticks). Ideony must expose a /test/state endpoint and deterministic seed+freeze controls before any harness is worth writing.
Layer the testing pyramid. 60% of “multi-role” bugs can be caught with WebSocket integration tests using two socket.io-client instances against a real NestJS gateway — no mobile runner, 100× faster, deterministic. Reserve full orchestrated multi-device runs for the 6 canonical scenarios.

Budget estimate: 3-4 weeks to build, ~€0/month ongoing (self-hosted ARM64 runner already exists). Alternative (QAWolf managed): ~€8-15k/yr + still can’t do multi-device orchestration without custom glue. OSS route wins on both cost and capability.

Section 1 — SOTA pattern deep-dive

1.1 Uber — Composable Testing Framework (CTF) + BITS

Source of truth: Uber Blog, “Shifting E2E Testing Left at Uber” (2024-08-22) + DPE Summit 2024 talk by Daniel Tsui & Quess Liu + Signadot deep-dive.

Stack

Composable Testing Framework (CTF) — internal code-level DSL, JVM-based, where every test action is a pure function over a centralized trip state object. E.g. rider.requestRide(), driver.accept(), driver.arrive() — each mutates a shared TripState snapshot.
BITS (Backend Integration Testing Strategy) — Cadence-orchestrated workflow engine that provisions ephemeral sandboxes that route test traffic through production services via OpenTelemetry baggage-based context propagation. Tests run against real prod downstreams, but side-effects are scoped to test-account tenancy.
Cadence (now Temporal) for sandbox lifecycle
Jaeger trace indices to measure endpoint coverage per test

Multi-actor pattern The forward dispatch example in their blog is the canonical reference: a driver receives a new pickup before completing the prior drop. CTF represents both actors as operations on shared trip state. The framework ensures serialization — driver.accept() cannot run until rider.requestRide() emits the matching event.

Shared state / fixtures Test tenancy propagated via OpenTelemetry baggage → all Kafka topics, DB writes, RPC calls carry tenant=test-acc-123 → real production routing infrastructure reroutes those to sandboxed data stores. No DB seeding per se; production is the seed.

WebSocket / real-time Covered by context propagation — dispatch events carry the tenant tag, dispatch service routes them to the sandbox WS gateway. Test asserts on trip state after event, not on raw WS frames.

GPS simulation Internal “rider/driver simulator” — proprietary, not open-sourced. Drives location updates via mocked GPS feed into the dispatch engine.

Push notifications Captured via sandboxed push bus (APNS/FCM test accounts); assertion looks for push payload delivery to the test tenant’s notification queue.

Visual regression Not primary — Uber’s testing is backend-heavy. Mobile visual testing uses standard Espresso/XCUITest snapshots.

Cost / CI time Several thousand E2E tests run pre-merge. Individual test runtime amortized via Cadence parallelism. Pass rate 90%+ per attempt, boosted to 99.9% via retry. Reduced incidents/1000 diffs by 71% in 2023.

Takeaway for Ideony Uber’s pattern is a microservice-scale solution; Ideony is a monolith. What transfers: the state-first test DSL idea. Write tests as actions-on-booking-state, not as UI click sequences. Ideony’s equivalent of TripState is BookingState + SOSDispatchState.

1.2 Airbnb — Cypress + Happo + Ruby integration + one-shard-per-PR

Sources: Happo.io case studies, Better Android Testing at Airbnb - Part 2 by Eli Hart, multiple Airbnb Tech Blog posts.

Stack

Cypress for web E2E
Happo (external SaaS) for visual regression — screenshots uploaded async, keyed by git SHA, PR-blocking diff review
Espresso + Happo for Android; approach-testing mock variants × screens → Happo diff
Internal “Happyhost” simulator (mentioned in various talks) — simulated backend for deterministic E2E seeding
Jest for unit

Multi-actor pattern Airbnb has a lighter multi-actor requirement than Uber — host-vs-guest interactions are mostly async messaging, not real-time. Their pattern uses two Cypress browser contexts (via cy.session() with different cookie jars) + DB fixture seeding. One test file, two personas, synchronization via polling a shared mock inbox.

Shared state Database fixtures — Rails+Postgres seed snapshots keyed to test scenarios. Each test resets to a known state.

WebSocket / real-time Minimal — messaging is push/email primarily. Where WS exists (inbox updates), tested at the service level via RSpec + stub adapters.

Visual regression Happo is the crown jewel. Async bitmap upload + cross-browser (Chrome, Firefox, Safari, Edge, iOS Safari) parallelization. Eli Hart’s Android article describes how every mock variant × every screen = a screenshot. PR build posts diff comment; reviewers approve visual changes as part of code review.

CI time / cost Happo priced competitively against Percy (cheaper, per their marketing; exact pricing gated behind sales). Cypress shards per PR on internal infra.

Takeaway for Ideony Already using Lost Pixel for visual regression — correct pattern, cheaper alternative. Two-Cypress-contexts pattern doesn’t transfer directly (Ideony is mobile-first), but the persona seed + polling sync approach does transfer to two-device Maestro orchestration.

1.3 DoorDash — Multi-tenancy in production + “DoorTest” guardrails

Source: Moving e2e testing into production with multi-tenancy + Drive Delivery Simulator docs.

Stack

Kotlin backend + Kafka
Multi-tenant gRPC interceptor model — every request carries a tenant header; test tenants live alongside prod tenants
Internal UI tool + gRPC service for devs to spawn test users (consumer, dasher), simulate geolocation, create test stores, assign test orders
Delivery Simulator — public dev portal tool that advances an order through states (Created → Dasher Confirmed → Arrived at Pickup → Picked Up → Arrived at Dropoff → Delivered) without dispatching real dashers

Multi-actor pattern Test consumer places order at a test store (real stores don’t accept test orders; test stores don’t accept real orders — enforced by tenant guardrail). Test dasher picks up. All state transitions exercise the real dispatcher, payment, and notification services, but side-effects (money, SMS, actual deliveries) are routed to no-op sinks.

Shared state No seeding — production is the environment. Guardrails enforce isolation. Test scenarios are reliably reproducible because the tooling programmatically creates test users, addresses, stores.

WebSocket / real-time Driver app receives dispatch via the same real-time channel; tenancy header determines routing. Tests assert on state-machine transitions, not frame-level WS.

GPS Test user address simulation baked into the internal tool — set lat/long per user, dispatcher uses it for matching.

Takeaway for Ideony Multi-tenancy in production is overkill for Ideony’s stage. But the DoorTest tooling idea is directly applicable: build a /test/scenarios admin API that seeds a named scenario ("SOS_BURST_PIPE_ROME") → pre-created consumer + 3 pros within 10km + fake booking state. Multi-role test just calls POST /test/scenarios/sos_burst_pipe_rome then drives two actors through the flow.

1.4 Bolt — City simulator + SimPy-style event loop

Source: Simulating cities for a better ride-hailing experience at Bolt (Aug 2019, still referenced).

Stack

Python-based city simulator
OSRM for map routing
Trained ETA + matching probability models
Event-driven tick loop (simulated heartbeat)
Real backend matching/pricing/dispatching algorithms plugged in as black boxes

Multi-actor pattern Not strictly E2E — Bolt’s simulator is algorithm validation more than product validation. Generates N virtual riders + N drivers per city, runs them through a day’s worth of events, measures aggregate KPIs (avg pickup time, cancellation rate, utilization). Before deploying a new matching algorithm, run it through the sim; compare to baseline.

Shared state Agents (riders, drivers) are in-process Python objects. Historical order data seeds realistic arrival rates.

Real-time assertions None — simulator is offline/batch. Real-time E2E happens separately via Maestro + Appium for UI flows.

Takeaway for Ideony Pre-launch Ideony doesn’t have historical data yet — a city simulator is premature. But the event-driven tick loop is the right abstraction once volume exists. Post-v1, a simulator of 50 pros + 20 consumers in Rome could stress-test the SOS cascade dispatch algorithm. Mark as Phase F+ (not MVP, not Phase E).

1.5 Glovo — Jarvis + SimPy simulator + “teswiz”-style multi-platform E2E

Sources: How to Simulate a Global Delivery Platform, Glovo career pages citing Appium + Kotlin.

Essentially identical to Bolt: Python SimPy simulator for algorithm research, separate Appium + Kotlin E2E pipeline for product validation. Jarvis (their dispatcher) is tested in the sim using event-based simulation with trained probability models.

Takeaway for Ideony Same as Bolt — defer the algorithm simulator. Adopt the Glovo hiring-page stack (Appium + OOP test framework + Grafana/Sentry logs) as a floor target, but substitute Maestro for Appium (equivalent for our purposes, much lighter setup).

1.6 Amazon — Selenium Grid + synthetic canaries + infrequent multi-actor

Sources: Balancing the Test Pyramid the AWS way + AWS Builder Library (public re:Invent talks).

Stack

Selenium Grid + Appium for UI
Synthetic canaries (CloudWatch Synthetics or internal equivalent) running in production every minute
Heavy reliance on pre-prod canary deploys — code is tested in prod with tiny blast radius rather than in staging

Multi-actor pattern Rare. Seller-buyer interactions are typically async (seller lists, buyer buys days later). When needed, separate test envs per persona with pre-loaded state rather than live orchestration.

Takeaway for Ideony Synthetic canaries in prod post-launch = excellent. Run the booking-completion scenario every 5 minutes against prod (w/ tenant=synthetic) → monitor success rate. This is a post-v1 concern, not Phase E.

1.7 Cross-cutting pattern summary

Distilled across all six:

Dimension	Dominant pattern
Runner	Whatever single-user runner matches the platform (Cypress for web, Espresso/XCUITest for native, Appium/Maestro for React Native)
Orchestration	Node.js or Kotlin or Python glue script in CI
Shared state	Centralized mutable test-state object exposed via API OR multi-tenant production routing
Real-time sync	Poll `/test/state` OR subscribe to a tap on the event bus (Redis pub/sub, Kafka topic)
GPS	Simulator injects lat/long through the real dispatch service; never through mobile OS APIs at E2E layer
Visual	Async screenshot SaaS (Happo, Percy, Chromatic, Applitools, Lost Pixel) keyed by SHA
Fault injection	Separate chaos runs, not interleaved with functional E2E

Nobody uses AI-authored tools at SOTA scale for multi-actor flows. The AI tools market is aimed at QA-team-light startups that want to replace manual testers, not at engineering-driven orgs that already write test code.

Section 2 — AI-powered E2E tools comparison

2.1 QAWolf

Sources: qawolf.com, Bug0 review, QAWolf pay-per-test pricing post (2026-01-19).

Model: Managed service. Humans + AI author Playwright/Appium tests. Charged per test per month (undisclosed per-test price, anchored around “roughly half an in-house QA engineer”). Example cited: 400-800 tests total per mid-size app.
Multi-user: Playwright supports it via multiple browser contexts; QAWolf can author such tests. Real-time multi-mobile orchestration: no explicit support.
Mobile: Web + iOS + Android via Appium; no React Native specialization.
CI: GitHub Actions supported. Webhook on deploy triggers suite.
SOTA verdict: Would Airbnb use it? No — Airbnb has its own test engineers. Would a 5-person YC startup use it? Yes, to offload QA entirely.
Ideony fit: 4/10. Cost unknown but likely €500-2000/mo given “half a QA engineer” anchor. Covers single-user flows well. For SOS dispatch / real-time / GPS / WebSocket assertions — would need custom Playwright code they author on your behalf, losing the “managed” value prop. Lock-in risk: tests run on their infra.

2.2 Momentic

Sources: momentic.ai/enterprise, trendingaitools.com Momentic review, Bug0 Momentic review.

Pricing (2026-04): Starter free (50 runs/mo, 1 env); Pro $99/mo (1000 runs/mo); Business custom. $15M Series A by Standard Capital (recent).
Multi-user: Web only. No multi-mobile orchestration. Chrome extension records flows → AI generates self-healing selectors. Cannot coordinate two simultaneous contexts natively; user has to script that themselves.
Mobile: Web-only as of this writing (flagged in reviews; “mobile support pending”).
CI: CI/CD webhooks, GitHub Actions. 99.99% uptime SLA, SOC2 Type 2.
Verdict for Ideony: 3/10. Disqualified by mobile-only limitation. Ideony is Expo-first; web is a secondary target. Even for the web preview, multi-role coordination is DIY.

2.3 TestRigor

Sources: testRigor FAQ, stackpick pricing.

Pricing: Free for public tests; $300/mo private (lowest paid tier). Pricing scales with parallelization units, not tests-per-month (favorable for large suites, unfavorable for small ones).
Multi-user: Explicitly advertised. FAQ mentions “multiple users to interact via email, sms, or instant messages.” Plain-English test DSL: "login as user1", "send message 'hello' to user2", "verify user2 receives 'hello'".
Mobile: Web + mobile via internal runners. Native iOS/Android less mature than web.
CI: REST API trigger from GitHub Actions.
Verdict for Ideony: 5/10. Plain-English spec is attractive for non-technical cofounder to author tests. But €300+/mo + black-box infrastructure + unknown maturity of multi-role feature for our specific (Expo + WebSocket + GPS) shape makes it risky. Would require a paid pilot before committing. Lock-in is high (tests are in testRigor’s DSL, not portable).

2.4 Applitools Eyes

Source: applitools.com/pricing, Visual Sentinel 2026 comparison.

Pricing: Free 100 checkpoints/mo; paid starts ~$899/mo (cited for 1000 checkpoints). No public mid-tier pricing.
Not a functional E2E tool. Adds visual + Ultrafast Grid (cross-browser) on top of Playwright/Cypress/Appium.
Multi-user: N/A — it’s a visual layer, orthogonal to orchestration.
Verdict for Ideony: 2/10 as a multi-role solution (not its purpose). 6/10 as a potential Lost Pixel upgrade if visual complexity grows. Ideony already uses Lost Pixel (free, self-hosted); no reason to switch pre-revenue.

2.5 Mabl

Sources: saascounter.net pricing survey, vendor comparisons.

Pricing: ~$250-450/mo starting (low-code, AI-assisted). Not self-serve — requires sales call.
Multi-user: Limited; primarily single-flow low-code.
Mobile: Web + limited mobile.
Verdict: 3/10. Enterprise-oriented, pricing opaque, no natural fit for mobile-first real-time product.

2.6 Reflect.run

Source: reflect.run/pricing.

Pricing: Team $225/mo (web+API, 500 credits/mo), Premium contact sales. Mobile testing is a paid add-on, “private mobile” tier is Enterprise.
Multi-user: Documented only as separate tests; no orchestration primitives.
Verdict: 3/10. Similar profile to Mabl — web-first, mobile bolted on.

2.7 Checkly

Source: scanlyapp 2026 Checkly alternatives.

Pricing: Free (10k API runs/mo); Team $64/mo (100k API runs, 12k browser runs).
Positioning: Monitoring-as-code, not E2E test authoring. Playwright scripts in git, run on schedule against prod.
Multi-user: Playwright’s multi-context capability available since tests are raw Playwright code.
Verdict for Ideony: 7/10 as a post-launch synthetic monitoring solution — run the booking happy-path every 5 minutes in prod, alert on failure. Not a multi-role development tool but a complement.

2.8 Autify

Public info: web + native app via AI-assisted test recorder, $99/mo starter, $450/mo Pro. No meaningful multi-role support beyond single-user flows.

Verdict: 3/10.

2.9 Functionize, DevCycle

Functionize: Enterprise AI testing, no public pricing, no multi-role emphasis. 2/10.
DevCycle: Feature flag platform — not an E2E tool. Mis-categorized in the brief. Could be used for test-gating (flag-on-for-test-tenant) but that’s orthogonal. N/A.

2.10 Ranking for Ideony (1-10)

Tool	Ideony score	Reasoning
Checkly	7	Great for post-launch synthetic monitoring; not primary E2E tool
TestRigor	5	Explicit multi-user support, but opaque infra + $300/mo + lock-in
QAWolf	4	Managed quality but cost unknown, multi-mobile not their strength
Applitools	3	Orthogonal (visual only); already have Lost Pixel
Mabl / Reflect / Autify	3	Web-biased, mobile bolted on
Momentic	3	Web-only; mobile promised-but-absent
Functionize	2	Enterprise-only, no pricing transparency

None are >7. No AI tool offers a compelling out-of-the-box solution for Ideony’s specific cocktail: Expo (iOS+Android+Web from same codebase) + Socket.IO gateway + GPS streaming + two-actor coordination + Italian-first locale.

Section 3 — Build-vs-buy matrix

3.1 Ideony’s requirements

2 mobile clients (consumer + pro) ± optional 3rd (admin web)
Real BE (NestJS + Postgres + Redis + Socket.IO), only Stripe/Clerk/Novu in test mode
Deterministic state seeds via existing pnpm seed:demo + future /test/scenarios/:name
6 canonical scenarios (defined in project_multi_role_e2e.md)
IT + EN locales
Expo (iOS/Android/Web) from single codebase
GPS streaming (SOS dispatch)
WebSocket delivery assertions (chat, dispatch, tracking)
Push-notification timing (Novu → Resend/Twilio/Expo Push)

3.2 Decision matrix

Dimension	OSS custom orchestrator	TestRigor	QAWolf managed	Mabl / Autify
Initial eng cost	3-4 wk senior eng	1 wk integration + learning curve	1 wk onboarding	1 wk onboarding
Monthly cost	€0 (self-hosted ARM64 runner exists)	€300-900	€500-2000 est.	€250-450
Annual cost (yr 1)	~€0 + time sunk	~€5-11k	~€6-24k	~€3-5.5k
Multi-mobile orchestration	Full control	Limited	Limited	Web-biased
GPS simulation	Full control via backend API	Blocked (no runner access)	Via custom Playwright	Via custom Playwright
WebSocket assertion	Direct `socket.io-client`	Via their DSL (uncertain fidelity)	Via custom Playwright	Limited
Debuggability	100% (our code, our logs)	Dashboard-based	Dashboard-based	Dashboard-based
Lock-in	None	High (their DSL)	Medium (Playwright artifacts exported)	High
Adding 7th scenario	~1 day	~1 day in DSL	Request new test (billed)	~1 day
Skill on team	TS, Node, Socket.IO — already have	Plain English — anyone	None needed	None needed
Pre-revenue startup fit	Excellent	Marginal	Poor	Poor

3.3 Recommendation

Build. Rationale:

Ideony already has the infrastructure: Maestro license (free tier, ARM64 runner), Playwright installed, NestJS backend with /test/* endpoints possible, Socket.IO gateway already instrumented.
The 3-week eng cost front-loads cleanly into Phase E; AI tools shift cost to monthly burn without solving the multi-role problem fully.
Debuggability matters more than anything when multi-role tests fail at 2am — our code, our logs, our traces beat a black-box SaaS dashboard every time.
Exit cost: if in 2 years we want to migrate to QAWolf, our Maestro flows and Playwright specs are portable (they’re the same artifacts QAWolf would author).

Caveat: If cofounders want cofounder-level QA (i.e. non-technical PM writes tests), TestRigor becomes more attractive. But that’s a product-org decision, not a technical one.

Section 4 — Recommended architecture

4.1 Stack

Layer	Tool	Version	Role
Mobile single-user runner	Maestro	2.4.0+ (CLI)	Per-device YAML flows, 122 flows exist
Web single-user runner	Playwright	@playwright/test 1.48+	Web E2E, browser contexts for secondary multi-user
Native iOS-specific fallback	Detox	20.x	Only where Maestro can’t reach (rare)
Orchestrator	Node.js + TypeScript + Vitest	Node 22 LTS + Vitest 2.x	Multi-role coordinator
Event bus (test)	Redis pub/sub	8.6 (existing)	Sync primitive between runners
WebSocket client (integration layer)	socket.io-client	4.8+	Direct multi-client assertions
State control	NestJS `/test/scenarios` module	New, gated by `NODE_ENV=test` + auth header	Deterministic seeds
Visual	Lost Pixel (existing)	—	No change
GPS simulation	Custom via BE `/test/geo-feed` endpoint	New	Inject GPS into dispatch service, not into mobile OS
CI	GitHub Actions + self-hosted ARM64 runner	existing	Orchestrates all above
Observability	Sentry (BE) + run artifacts (screenshots, videos, WS transcripts)	existing Sentry + new artifact bundler	Failure triage

4.2 Harness design

Layered pyramid (bottom = fastest, broadest; top = slowest, narrowest):

                    +---------------------------+
                    |  Multi-device Maestro     |   <-- 6 canonical scenarios
                    |  orchestrated (CI only)   |       ~2-5 min each
                    +---------------------------+
                  +-----------------------------+
                  |  Playwright 2-context web   |       <-- subset for web
                  |  multi-role (local+CI)      |           ~30s each
                  +-----------------------------+
              +-----------------------------------+
              |  Multi-client socket.io-client    |     <-- bulk of coordination
              |  integration tests (no UI)        |         ~1-3s each
              +-----------------------------------+
          +---------------------------------------+
          |  Single-service NestJS unit + e2e     |        <-- existing 300+ BE tests
          +---------------------------------------+

Crucially: 80% of multi-role bugs can be caught at the “multi-client socket.io-client” layer without a single mobile runner. Two node processes, real BE, real Postgres, real Redis — assert on WS frames + DB state. Only the top 2 layers need the heavy orchestration.

4.3 Shared state strategy

Three seed mechanisms, used in combination:

pnpm seed:demo — persistent demo data (existing). Baseline: 20 pros, 5 consumers, Rome-Milan-Turin, realistic distribution. Run once per CI job startup.
POST /test/scenarios/:name — scenario-specific superposition. E.g. sos_burst_pipe_rome creates consumer test-c-1, pro test-p-1/2/3 each at specific lat/long within 10km of consumer, pending booking in state CREATED. Idempotent; wipes prior test-scoped rows for that name.
POST /test/cleanup — end-of-test sweep. Deletes rows tagged with test_tenant=<uuid>; each multi-role test generates a fresh tenant ID at start.

DB isolation: all test rows get a test_tenant column (nullable for prod data). Middleware adds WHERE test_tenant = $1 OR test_tenant IS NULL to read queries during tests. Hard guardrail: deletes under test tenant require matching tenant header.

Clock control: POST /test/clock/advance?seconds=300 — advances a centralized mocked clock (ClockService.now() abstraction already partially exists for booking reminders). Avoids real 5-minute waits in SOS countdown tests.

Deterministic randomness: seeded RNG exposed via CryptoService. Test mode sets a known seed via POST /test/seed.

4.4 Synchronization primitive

The critical design choice. Options considered:

Polling REST — test A polls GET /test/state/booking/:id until status=ACCEPTED. Simple, reliable, slow (1s granularity). Chosen for happy-path sync.
Redis pub/sub tap — test harness subscribes to events:bookings:* channel, awaits specific message. Low-latency (<50ms), deterministic. Chosen for timing-sensitive sync (GPS tracking, chat delivery).
Shared WS bus — test spawns its own socket.io-client, subscribes to dispatch events. Chosen for assertions about what pros receive (multi-client WS test layer).

Rule: polling for state convergence; pub/sub tap for event-fired assertions.

4.5 Sample scenario walkthrough — Scenario 1

“Consumer search → pro match → book → pay → complete”

File layout:

test/e2e-multi-role/
  scenarios/
    01-booking-full-cycle/
      scenario.spec.ts          <-- orchestrator entry
      consumer.flow.yaml        <-- Maestro flow for consumer app
      pro.flow.yaml             <-- Maestro flow for pro app
      assertions.ts             <-- shared DB + WS assertions
      fixtures/
        scenario-seed.json      <-- extra seed overrides
  lib/
    orchestrator.ts             <-- spawns + coordinates Maestro processes
    test-api.ts                 <-- wraps /test/* endpoints
    ws-tap.ts                   <-- Redis + socket.io listener helpers
    sync.ts                     <-- waitForState, waitForEvent helpers
    artifacts.ts                <-- collects screenshots, logs, ws-transcripts on failure

Shape of scenario.spec.ts (illustrative, not runnable):

import { describe, it, beforeAll, afterEach } from 'vitest';
import { Orchestrator } from '../../lib/orchestrator';
import { TestApi } from '../../lib/test-api';
import { waitForBookingStatus, waitForWsEvent } from '../../lib/sync';

describe('Scenario 01: Booking full cycle', () => {
  let orch: Orchestrator;
  let api: TestApi;
  let tenantId: string;

  beforeAll(async () => {
    api = new TestApi();
    tenantId = await api.createTenant();
    await api.seedScenario('booking_full_cycle_rome', tenantId);
    orch = new Orchestrator({
      consumer: { device: process.env.CONSUMER_DEVICE_UDID!, appId: 'app.ideony.consumer' },
      pro: { device: process.env.PRO_DEVICE_UDID!, appId: 'app.ideony.pro' },
    });
  });

  afterEach(async () => {
    if (context.failed) await orch.collectArtifacts();
    await api.cleanupTenant(tenantId);
  });

  it('consumer books, pro accepts, both complete', async () => {
    // Phase 1: consumer searches and books
    await orch.runFlow('consumer', 'consumer-search-and-book.yaml', {
      TENANT_ID: tenantId,
      EXPECTED_PRO: 'test-p-1',
    });

    const bookingId = await api.getLatestBookingId(tenantId);

    // Phase 2: assert booking exists in PENDING_ACCEPTANCE + pro got push
    await waitForBookingStatus(api, bookingId, 'PENDING_ACCEPTANCE', { timeoutMs: 5000 });
    await waitForWsEvent('booking:new', { bookingId, proId: 'test-p-1' });

    // Phase 3: pro accepts via their app
    await orch.runFlow('pro', 'pro-accept.yaml', { TENANT_ID: tenantId, BOOKING_ID: bookingId });
    await waitForBookingStatus(api, bookingId, 'ACCEPTED');

    // Phase 4: simulate clock advance to scheduled time
    await api.advanceClock(3600); // 1hr forward

    // Phase 5: pro marks arrived, then completed
    await orch.runFlow('pro', 'pro-arrive-complete.yaml', { BOOKING_ID: bookingId });
    await waitForBookingStatus(api, bookingId, 'COMPLETED');

    // Phase 6: consumer sees receipt + review prompt
    await orch.runFlow('consumer', 'consumer-verify-completion.yaml', { BOOKING_ID: bookingId });

    // Phase 7: final state asserts
    const booking = await api.getBooking(bookingId);
    expect(booking.status).toBe('COMPLETED');
    expect(booking.paymentCapturedAt).toBeDefined();
  });
});

Key properties:

Each orch.runFlow spawns a fresh maestro test --device <udid> --env TENANT_ID=… and awaits exit code.
Between Maestro invocations, Node-level assertions + BE API calls — fast, deterministic.
waitForWsEvent subscribes to Redis events:bookings:* channel, resolves on match or timeout.
Tenant isolation means scenarios can run in parallel on separate emulator pairs.

4.6 CI workflow

# .github/workflows/e2e-multi-role.yml (shape only, not production code)
name: E2E Multi-role
on: [pull_request, workflow_dispatch]
jobs:
  orchestrated:
    runs-on: [self-hosted, arm64, macos]
    strategy:
      fail-fast: false
      matrix:
        scenario: [01-booking, 02-sos, 03-cancel, 04-chat, 05-credentials, 06-rating]
    timeout-minutes: 15
    steps:
      - uses: actions/checkout@v5
      - run: pnpm install --frozen-lockfile
      - run: pnpm docker:up              # Postgres + Redis + MinIO + Mailpit
      - run: pnpm --filter @ideony/api migrate deploy
      - run: pnpm --filter @ideony/api seed:demo
      - run: pnpm --filter @ideony/api start:test-mode &   # exposes /test/* endpoints
      - run: ./scripts/boot-emulator-pair.sh              # 2 Android emulators
      - run: pnpm --filter @ideony/mobile build:test-apk
      - run: pnpm --filter @ideony/mobile install:test-apks
      - run: pnpm test:multi-role -- --scenario ${{ matrix.scenario }}
      - if: failure()
        uses: actions/upload-artifact@v4
        with:
          name: artifacts-${{ matrix.scenario }}
          path: test/e2e-multi-role/artifacts/

Runtime budget: each scenario 2-5 min. 6 scenarios × parallel workers (2 concurrent on ARM64 runner w/ 4 emulators) = ~15 min wall time. Acceptable gate for PRs touching apps/api/src/modules/{booking,sos,credentials} or apps/mobile.

Flakiness mitigation:

Retries: 2 retries per scenario in CI (but log + surface flake counts to Sentry)
Placebo tests (Uber’s trick): duplicate run of a scenario with no code change → measure raw flake rate per scenario
Screenshot + video + WS-transcript + BE-log bundle on failure, posted as PR comment
Emulator snapshots per-run rather than per-session (avoid accumulated state)

4.7 Observability on failure

The failure triage loop is the make-or-break of multi-role testing. Design:

Correlation ID per scenario run — X-Test-Run-Id: <uuid> header on every request, stamped onto every NestJS log line via ClsModule, threaded into Sentry scope.
WS transcript — orchestrator’s own socket.io-client logs every frame it sees, dumped to artifacts/<run-id>/ws-transcript.ndjson.
Screenshots — Maestro --screenshot-on-failure + explicit takeScreenshot at key assertion points.
Video — Maestro cloud video recording (free tier for up to N runs; self-host screenrecord as fallback).
BE log bundle — on test failure, orchestrator calls GET /test/logs?run_id=<uuid> which returns structured logs from all BE services for that correlation ID.
DB snapshot — on failure, pg_dump the test-tenant rows for post-mortem.
Sentry replay — already configured for mobile; replay ID linked into failure artifact bundle.

Failure comment shape on PR:

Scenario 02 (SOS dispatch) failed at step 4: waitForBookingStatus timeout after 10s — expected ACCEPTED, got DISPATCHING. Artifacts: WS transcript • Consumer screenshot • Pro screenshot • BE logs • DB snapshot • Sentry event

4.8 Rollout phases (refined)

Replaces the project_test_coverage_next_steps.md plan, retains its Phase E numbering:

E0 — Prerequisites (3 days)
- /test/scenarios/:name endpoint (NestJS module, gated by process.env.TEST_MODE === '1' + shared secret header)
- ClockService abstraction with advanceClock test-only method
- test_tenant column + middleware
- Orchestrator scaffolding (lib/orchestrator.ts, lib/test-api.ts, lib/sync.ts)
E1 — Single-user refresh (1 wk)
- Refresh existing 122 Maestro flows against current main (Sole palette + NativeWind + SOS gateway changes)
- Add onboarding wizard flow once PR ad665e05 lands
E2 — WebSocket integration layer (3 days) — highest ROI step
- Multi-client socket.io-client integration tests for chat delivery, booking state broadcast, SOS cascade dispatch
- Runs in same NestJS test harness as BE unit tests; no mobile runners
E3 — Scenario 1 (booking) + 5 (credentials) orchestrated (1 wk)
- Exercises full orchestrator without the most complex features (no GPS streaming)
E4 — Scenario 2 (SOS full cycle) orchestrated (1 wk)
- Most complex: /test/geo-feed endpoint for GPS injection, clock advance through countdown, cascade dispatch winner assertion
E5 — Scenarios 3, 4, 6 + CI wiring + Lost Pixel refresh cadence (1 wk)
- Complete the canonical 6
E6 — Post-launch: synthetic canaries via Checkly (Free tier) against prod — deferred until v1 stable

Total: ~4 weeks of focused eng work. Monthly cost €0 (Maestro CLI + Playwright + Lost Pixel all free tiers). Optional €64/mo Checkly after launch.

Section 5 — Open questions for brainstorming

Before locking the architecture, these need user input:

Test tenancy vs ephemeral DB? The recommended pattern uses test_tenant column + middleware on a shared DB. Alternative: spin up a fresh Postgres via docker compose per CI job (heavier startup, cleaner isolation). Preference?
GPS injection — backend API or device-level? Recommended: backend /test/geo-feed that bypasses mobile GPS entirely, pro app receives lat/long via dispatch WS event. Alternative: Android mock-location + Xcode simulator GPS. Backend approach is faster/simpler but tests less of the real GPS→app path. Acceptable tradeoff?
Clock mocking scope? ClockService.now() abstraction touches Booking reminders, SOS countdown, reservation expiry. Retrofitting the whole codebase is ~1 day’s work. Alternative: scenario-specific fake clock via request header. The latter is hackier but isolates blast radius. Which?
Cofounder-authored tests? TestRigor’s plain-English DSL would let non-technical cofounders write tests. Worth the €300+/mo + lock-in cost? Or keep tests eng-only?
Parallel scenario execution — 2 or 4 emulator pairs? ARM64 Mac mini runner handles 2 pairs comfortably. 4 pairs would need a second runner (~€50/mo Hetzner). Worth it for wall-time reduction, or accept 15-min gate?
Visual regression on multi-role scenarios? Lost Pixel integrated at assertVisible points inside flows, OR a dedicated non-multi-role visual pass? The latter is cheaper and avoids flake amplification but misses visual regressions that only appear during cross-user flows (e.g. chat bubbles rendering).
Maestro Cloud vs local orchestration? Maestro Cloud offers parallel device farm for $$. Self-hosted saves money but caps parallelism. Start self-hosted, escalate if scenario 6+ times out?
Fail-fast policy? On first scenario failure in a matrix run, abort remaining scenarios to save CI minutes? Or always run all 6 for full diagnostic? Recommend the latter despite cost (6×2min = 12 emulator-minutes is cheap; diagnostic value is high).
Real Stripe/Clerk/Novu in tests? Stripe test-mode + Clerk dev tenant + Novu dev env have rate limits. Hitting them per-PR could throttle CI. Recommend: mock Clerk JWT (fixed signing key in test mode), use Stripe test cards, Novu dev env with retries. Is that acceptable, or should we mock all three externals entirely?
Multi-role post-launch — extend to 3-actor (consumer + pro + admin approving credentials live)? Scenario 5 as currently scoped has admin approval offline. A 3-actor version would validate the full admin-review flow but needs a 3rd emulator or a Playwright browser for the admin web UI. Scope in Phase E or defer?

Sources cited

SOTA companies:

Uber Blog, “Shifting E2E Testing Left at Uber” (2024-08-22) — https://www.uber.com/blog/shifting-e2e-testing-left/
DPE Summit 2024 talk (Gradle) — https://www.youtube.com/watch?v=Lat87fBTShQ
Signadot, “Shifting E2E Left on Microservices” (2024-10-21) — https://medium.com/@signadot/shifting-end-to-end-testing-left-on-microservices-e3c6b0adf2cb
DoorDash Engineering, “Moving e2e testing into production with multi-tenancy” (2022-03-03) — https://doordash.engineering/2022/03/03/moving-e2e-testing-into-production-with-multi-tenancy-for-increased-speed-and-reliability/
DoorDash Delivery Simulator docs — https://developer.doordash.com/docs/drive_classic/how_to/use_delivery_simulator/
Bolt Labs, “Simulating cities for a better ride-hailing experience” (2019-08-15) — https://medium.com/bolt-labs/simulating-cities-for-a-better-ride-hailing-experience-at-bolt-f97af9190ada
Glovo Engineering, “How to Simulate a Global Delivery Platform” (2021-12-30) — https://medium.com/glovo-engineering/how-to-simulate-a-global-delivery-platform-7aa5fa475d88
Airbnb Tech Blog, “Better Android Testing at Airbnb — Part 2: Screenshot Testing” — https://medium.com/airbnb-engineering/better-android-testing-at-airbnb-a77ac9531cab
Happo.io docs + testimonials — https://happo.io/
LambdaTest, “Balancing the Test Pyramid the AWS way” — https://lambdatest.com/blog/balancing-the-test-pyramid-the-aws-way

Tooling:

Maestro docs — https://maestro.dev/, https://docs.maestro.dev/
Maestro multi-device issue #2957 — https://github.com/mobile-dev-inc/Maestro/issues/2957
Maestro parallel testing insights (2026-03-11) — https://maestro.dev/insights/parallel-testing-android-ios
Playwright BrowserContext docs — https://playwright.dev/docs/api/class-browsercontext
Playwright isolation guide — https://playwright.dev/docs/browser-contexts
OneUptime, “How to Configure Playwright Parallel Execution” (2026-01-28) — https://oneuptime.com/blog/post/2026-01-28-playwright-parallel-execution/view
Detox docs — https://wix.github.io/Detox/
Appium Pro, “Testing Real-Time User Interaction Using Multiple Simultaneous Appium Sessions” — https://appiumpro.com/editions/118-testing-real-time-user-interaction-using-multiple-simultaneous-appium-sessions
Socket.IO testing docs (v4) — https://socket.io/docs/v4/testing/
tests.ws, “WebSocket Testing Best Practices” (2026-02-13) — https://tests.ws/testing/websocket-testing-best-practices

AI tool pricing + reviews:

QAWolf pricing post (2026-01-19) — https://www.qawolf.com/blog/qa-wolf-is-reinventing-qa-pricing
Bug0, “QA Wolf Pricing” — https://bug0.com/knowledge-base/qa-wolf-pricing
Momentic Enterprise page — https://momentic.ai/enterprise
Momentic review — https://bug0.com/knowledge-base/momentic-review
Momentic on trendingaitools — https://www.trendingaitools.com/ai-tools/momentic-ai-2/
TestRigor FAQ — https://testrigor.com/faq
TestRigor pricing (Stackpick) — https://stackpick.net/pricing/testrigor/
Applitools pricing — https://applitools.com/pricing
Reflect.run pricing — https://reflect.run/pricing/
Checkly alternatives 2026 — https://scanlyapp.com/blog/checkly-alternatives-2026
Visual Sentinel, “Visual Regression Testing Setup 2026” — https://visualsentinel.com/blog/how-to-set-up-visual-regression-testing-2026

GPS simulation:

BrowserStack Appium GPS docs — https://www.browserstack.com/docs/app-automate/appium/test-real-user-conditions/simulate-gps-location
Detox setLocation PR #3479 — https://github.com/wix/Detox/pull/3479
iMobie GPS simulation dev guide — https://www.imobie.com/location-change/simulating-gps.htm

End of report.