Skip to content

Multi-role E2E test architecture for Ideony — deep research

Multi-role E2E test architecture for Ideony — deep research

Section titled “Multi-role E2E test architecture for Ideony — deep research”

Date: 2026-04-19 Status: research-only, no implementation Feeds into: future spec for Phase E multi-role test harness Authors: Claude (Opus 4.7, deep-research mode)


Ideony is a two-sided marketplace with real-time coordination semantics (SOS dispatch + booking state machine + chat + GPS tracking). Single-actor E2E tools (Maestro, Playwright, Detox) cover individual flows well but leave the hardest coordination bugs — race conditions on concurrent UPDATE booking SET status=…, WebSocket delivery guarantees, cascade dispatch winner-selection, GPS staleness, push-notification ordering — completely untested.

After auditing how Uber, Airbnb, DoorDash, Bolt, Glovo, and Amazon handle this problem and comparing every commercial AI E2E tool against an OSS-custom approach, this report recommends:

  1. Build, don’t buy. No commercial tool (QAWolf, Momentic, TestRigor, Reflect, Mabl) can orchestrate two Expo clients + shared BE state + WebSocket assertions + GPS streaming on CI at a price Ideony can absorb pre-revenue. They are all built for single-user web flows; multi-role is either absent or emulated via brittle multi-tab hacks.
  2. Pattern: Maestro per device + Playwright for web + thin Node.js orchestrator over Postgres + Redis as coordination substrate. This mirrors DoorDash’s multi-tenancy model, Uber’s Composable Testing Framework (CTF), and Bolt’s simulator — at SaaS-startup scale.
  3. State model is the hard part, not the runner. Every SOTA company converged on a centralized, introspectable test state (Uber CTF trip object, DoorDash “DoorTest” tenant, Bolt city simulator ticks). Ideony must expose a /test/state endpoint and deterministic seed+freeze controls before any harness is worth writing.
  4. Layer the testing pyramid. 60% of “multi-role” bugs can be caught with WebSocket integration tests using two socket.io-client instances against a real NestJS gateway — no mobile runner, 100× faster, deterministic. Reserve full orchestrated multi-device runs for the 6 canonical scenarios.

Budget estimate: 3-4 weeks to build, ~€0/month ongoing (self-hosted ARM64 runner already exists). Alternative (QAWolf managed): ~€8-15k/yr + still can’t do multi-device orchestration without custom glue. OSS route wins on both cost and capability.


1.1 Uber — Composable Testing Framework (CTF) + BITS

Section titled “1.1 Uber — Composable Testing Framework (CTF) + BITS”

Source of truth: Uber Blog, “Shifting E2E Testing Left at Uber” (2024-08-22) + DPE Summit 2024 talk by Daniel Tsui & Quess Liu + Signadot deep-dive.

Stack

  • Composable Testing Framework (CTF) — internal code-level DSL, JVM-based, where every test action is a pure function over a centralized trip state object. E.g. rider.requestRide(), driver.accept(), driver.arrive() — each mutates a shared TripState snapshot.
  • BITS (Backend Integration Testing Strategy) — Cadence-orchestrated workflow engine that provisions ephemeral sandboxes that route test traffic through production services via OpenTelemetry baggage-based context propagation. Tests run against real prod downstreams, but side-effects are scoped to test-account tenancy.
  • Cadence (now Temporal) for sandbox lifecycle
  • Jaeger trace indices to measure endpoint coverage per test

Multi-actor pattern The forward dispatch example in their blog is the canonical reference: a driver receives a new pickup before completing the prior drop. CTF represents both actors as operations on shared trip state. The framework ensures serialization — driver.accept() cannot run until rider.requestRide() emits the matching event.

Shared state / fixtures Test tenancy propagated via OpenTelemetry baggage → all Kafka topics, DB writes, RPC calls carry tenant=test-acc-123 → real production routing infrastructure reroutes those to sandboxed data stores. No DB seeding per se; production is the seed.

WebSocket / real-time Covered by context propagation — dispatch events carry the tenant tag, dispatch service routes them to the sandbox WS gateway. Test asserts on trip state after event, not on raw WS frames.

GPS simulation Internal “rider/driver simulator” — proprietary, not open-sourced. Drives location updates via mocked GPS feed into the dispatch engine.

Push notifications Captured via sandboxed push bus (APNS/FCM test accounts); assertion looks for push payload delivery to the test tenant’s notification queue.

Visual regression Not primary — Uber’s testing is backend-heavy. Mobile visual testing uses standard Espresso/XCUITest snapshots.

Cost / CI time Several thousand E2E tests run pre-merge. Individual test runtime amortized via Cadence parallelism. Pass rate 90%+ per attempt, boosted to 99.9% via retry. Reduced incidents/1000 diffs by 71% in 2023.

Takeaway for Ideony Uber’s pattern is a microservice-scale solution; Ideony is a monolith. What transfers: the state-first test DSL idea. Write tests as actions-on-booking-state, not as UI click sequences. Ideony’s equivalent of TripState is BookingState + SOSDispatchState.

1.2 Airbnb — Cypress + Happo + Ruby integration + one-shard-per-PR

Section titled “1.2 Airbnb — Cypress + Happo + Ruby integration + one-shard-per-PR”

Sources: Happo.io case studies, Better Android Testing at Airbnb - Part 2 by Eli Hart, multiple Airbnb Tech Blog posts.

Stack

  • Cypress for web E2E
  • Happo (external SaaS) for visual regression — screenshots uploaded async, keyed by git SHA, PR-blocking diff review
  • Espresso + Happo for Android; approach-testing mock variants × screens → Happo diff
  • Internal “Happyhost” simulator (mentioned in various talks) — simulated backend for deterministic E2E seeding
  • Jest for unit

Multi-actor pattern Airbnb has a lighter multi-actor requirement than Uber — host-vs-guest interactions are mostly async messaging, not real-time. Their pattern uses two Cypress browser contexts (via cy.session() with different cookie jars) + DB fixture seeding. One test file, two personas, synchronization via polling a shared mock inbox.

Shared state Database fixtures — Rails+Postgres seed snapshots keyed to test scenarios. Each test resets to a known state.

WebSocket / real-time Minimal — messaging is push/email primarily. Where WS exists (inbox updates), tested at the service level via RSpec + stub adapters.

Visual regression Happo is the crown jewel. Async bitmap upload + cross-browser (Chrome, Firefox, Safari, Edge, iOS Safari) parallelization. Eli Hart’s Android article describes how every mock variant × every screen = a screenshot. PR build posts diff comment; reviewers approve visual changes as part of code review.

CI time / cost Happo priced competitively against Percy (cheaper, per their marketing; exact pricing gated behind sales). Cypress shards per PR on internal infra.

Takeaway for Ideony Already using Lost Pixel for visual regression — correct pattern, cheaper alternative. Two-Cypress-contexts pattern doesn’t transfer directly (Ideony is mobile-first), but the persona seed + polling sync approach does transfer to two-device Maestro orchestration.

1.3 DoorDash — Multi-tenancy in production + “DoorTest” guardrails

Section titled “1.3 DoorDash — Multi-tenancy in production + “DoorTest” guardrails”

Source: Moving e2e testing into production with multi-tenancy + Drive Delivery Simulator docs.

Stack

  • Kotlin backend + Kafka
  • Multi-tenant gRPC interceptor model — every request carries a tenant header; test tenants live alongside prod tenants
  • Internal UI tool + gRPC service for devs to spawn test users (consumer, dasher), simulate geolocation, create test stores, assign test orders
  • Delivery Simulator — public dev portal tool that advances an order through states (Created → Dasher Confirmed → Arrived at Pickup → Picked Up → Arrived at Dropoff → Delivered) without dispatching real dashers

Multi-actor pattern Test consumer places order at a test store (real stores don’t accept test orders; test stores don’t accept real orders — enforced by tenant guardrail). Test dasher picks up. All state transitions exercise the real dispatcher, payment, and notification services, but side-effects (money, SMS, actual deliveries) are routed to no-op sinks.

Shared state No seeding — production is the environment. Guardrails enforce isolation. Test scenarios are reliably reproducible because the tooling programmatically creates test users, addresses, stores.

WebSocket / real-time Driver app receives dispatch via the same real-time channel; tenancy header determines routing. Tests assert on state-machine transitions, not frame-level WS.

GPS Test user address simulation baked into the internal tool — set lat/long per user, dispatcher uses it for matching.

Takeaway for Ideony Multi-tenancy in production is overkill for Ideony’s stage. But the DoorTest tooling idea is directly applicable: build a /test/scenarios admin API that seeds a named scenario ("SOS_BURST_PIPE_ROME") → pre-created consumer + 3 pros within 10km + fake booking state. Multi-role test just calls POST /test/scenarios/sos_burst_pipe_rome then drives two actors through the flow.

1.4 Bolt — City simulator + SimPy-style event loop

Section titled “1.4 Bolt — City simulator + SimPy-style event loop”

Source: Simulating cities for a better ride-hailing experience at Bolt (Aug 2019, still referenced).

Stack

  • Python-based city simulator
  • OSRM for map routing
  • Trained ETA + matching probability models
  • Event-driven tick loop (simulated heartbeat)
  • Real backend matching/pricing/dispatching algorithms plugged in as black boxes

Multi-actor pattern Not strictly E2E — Bolt’s simulator is algorithm validation more than product validation. Generates N virtual riders + N drivers per city, runs them through a day’s worth of events, measures aggregate KPIs (avg pickup time, cancellation rate, utilization). Before deploying a new matching algorithm, run it through the sim; compare to baseline.

Shared state Agents (riders, drivers) are in-process Python objects. Historical order data seeds realistic arrival rates.

Real-time assertions None — simulator is offline/batch. Real-time E2E happens separately via Maestro + Appium for UI flows.

Takeaway for Ideony Pre-launch Ideony doesn’t have historical data yet — a city simulator is premature. But the event-driven tick loop is the right abstraction once volume exists. Post-v1, a simulator of 50 pros + 20 consumers in Rome could stress-test the SOS cascade dispatch algorithm. Mark as Phase F+ (not MVP, not Phase E).

1.5 Glovo — Jarvis + SimPy simulator + “teswiz”-style multi-platform E2E

Section titled “1.5 Glovo — Jarvis + SimPy simulator + “teswiz”-style multi-platform E2E”

Sources: How to Simulate a Global Delivery Platform, Glovo career pages citing Appium + Kotlin.

Essentially identical to Bolt: Python SimPy simulator for algorithm research, separate Appium + Kotlin E2E pipeline for product validation. Jarvis (their dispatcher) is tested in the sim using event-based simulation with trained probability models.

Takeaway for Ideony Same as Bolt — defer the algorithm simulator. Adopt the Glovo hiring-page stack (Appium + OOP test framework + Grafana/Sentry logs) as a floor target, but substitute Maestro for Appium (equivalent for our purposes, much lighter setup).

1.6 Amazon — Selenium Grid + synthetic canaries + infrequent multi-actor

Section titled “1.6 Amazon — Selenium Grid + synthetic canaries + infrequent multi-actor”

Sources: Balancing the Test Pyramid the AWS way + AWS Builder Library (public re:Invent talks).

Stack

  • Selenium Grid + Appium for UI
  • Synthetic canaries (CloudWatch Synthetics or internal equivalent) running in production every minute
  • Heavy reliance on pre-prod canary deploys — code is tested in prod with tiny blast radius rather than in staging

Multi-actor pattern Rare. Seller-buyer interactions are typically async (seller lists, buyer buys days later). When needed, separate test envs per persona with pre-loaded state rather than live orchestration.

Takeaway for Ideony Synthetic canaries in prod post-launch = excellent. Run the booking-completion scenario every 5 minutes against prod (w/ tenant=synthetic) → monitor success rate. This is a post-v1 concern, not Phase E.

Distilled across all six:

DimensionDominant pattern
RunnerWhatever single-user runner matches the platform (Cypress for web, Espresso/XCUITest for native, Appium/Maestro for React Native)
OrchestrationNode.js or Kotlin or Python glue script in CI
Shared stateCentralized mutable test-state object exposed via API OR multi-tenant production routing
Real-time syncPoll /test/state OR subscribe to a tap on the event bus (Redis pub/sub, Kafka topic)
GPSSimulator injects lat/long through the real dispatch service; never through mobile OS APIs at E2E layer
VisualAsync screenshot SaaS (Happo, Percy, Chromatic, Applitools, Lost Pixel) keyed by SHA
Fault injectionSeparate chaos runs, not interleaved with functional E2E

Nobody uses AI-authored tools at SOTA scale for multi-actor flows. The AI tools market is aimed at QA-team-light startups that want to replace manual testers, not at engineering-driven orgs that already write test code.


Section 2 — AI-powered E2E tools comparison

Section titled “Section 2 — AI-powered E2E tools comparison”

Sources: qawolf.com, Bug0 review, QAWolf pay-per-test pricing post (2026-01-19).

  • Model: Managed service. Humans + AI author Playwright/Appium tests. Charged per test per month (undisclosed per-test price, anchored around “roughly half an in-house QA engineer”). Example cited: 400-800 tests total per mid-size app.
  • Multi-user: Playwright supports it via multiple browser contexts; QAWolf can author such tests. Real-time multi-mobile orchestration: no explicit support.
  • Mobile: Web + iOS + Android via Appium; no React Native specialization.
  • CI: GitHub Actions supported. Webhook on deploy triggers suite.
  • SOTA verdict: Would Airbnb use it? No — Airbnb has its own test engineers. Would a 5-person YC startup use it? Yes, to offload QA entirely.
  • Ideony fit: 4/10. Cost unknown but likely €500-2000/mo given “half a QA engineer” anchor. Covers single-user flows well. For SOS dispatch / real-time / GPS / WebSocket assertions — would need custom Playwright code they author on your behalf, losing the “managed” value prop. Lock-in risk: tests run on their infra.

Sources: momentic.ai/enterprise, trendingaitools.com Momentic review, Bug0 Momentic review.

  • Pricing (2026-04): Starter free (50 runs/mo, 1 env); Pro $99/mo (1000 runs/mo); Business custom. $15M Series A by Standard Capital (recent).
  • Multi-user: Web only. No multi-mobile orchestration. Chrome extension records flows → AI generates self-healing selectors. Cannot coordinate two simultaneous contexts natively; user has to script that themselves.
  • Mobile: Web-only as of this writing (flagged in reviews; “mobile support pending”).
  • CI: CI/CD webhooks, GitHub Actions. 99.99% uptime SLA, SOC2 Type 2.
  • Verdict for Ideony: 3/10. Disqualified by mobile-only limitation. Ideony is Expo-first; web is a secondary target. Even for the web preview, multi-role coordination is DIY.

Sources: testRigor FAQ, stackpick pricing.

  • Pricing: Free for public tests; $300/mo private (lowest paid tier). Pricing scales with parallelization units, not tests-per-month (favorable for large suites, unfavorable for small ones).
  • Multi-user: Explicitly advertised. FAQ mentions “multiple users to interact via email, sms, or instant messages.” Plain-English test DSL: "login as user1", "send message 'hello' to user2", "verify user2 receives 'hello'".
  • Mobile: Web + mobile via internal runners. Native iOS/Android less mature than web.
  • CI: REST API trigger from GitHub Actions.
  • Verdict for Ideony: 5/10. Plain-English spec is attractive for non-technical cofounder to author tests. But €300+/mo + black-box infrastructure + unknown maturity of multi-role feature for our specific (Expo + WebSocket + GPS) shape makes it risky. Would require a paid pilot before committing. Lock-in is high (tests are in testRigor’s DSL, not portable).

Source: applitools.com/pricing, Visual Sentinel 2026 comparison.

  • Pricing: Free 100 checkpoints/mo; paid starts ~$899/mo (cited for 1000 checkpoints). No public mid-tier pricing.
  • Not a functional E2E tool. Adds visual + Ultrafast Grid (cross-browser) on top of Playwright/Cypress/Appium.
  • Multi-user: N/A — it’s a visual layer, orthogonal to orchestration.
  • Verdict for Ideony: 2/10 as a multi-role solution (not its purpose). 6/10 as a potential Lost Pixel upgrade if visual complexity grows. Ideony already uses Lost Pixel (free, self-hosted); no reason to switch pre-revenue.

Sources: saascounter.net pricing survey, vendor comparisons.

  • Pricing: ~$250-450/mo starting (low-code, AI-assisted). Not self-serve — requires sales call.
  • Multi-user: Limited; primarily single-flow low-code.
  • Mobile: Web + limited mobile.
  • Verdict: 3/10. Enterprise-oriented, pricing opaque, no natural fit for mobile-first real-time product.

Source: reflect.run/pricing.

  • Pricing: Team $225/mo (web+API, 500 credits/mo), Premium contact sales. Mobile testing is a paid add-on, “private mobile” tier is Enterprise.
  • Multi-user: Documented only as separate tests; no orchestration primitives.
  • Verdict: 3/10. Similar profile to Mabl — web-first, mobile bolted on.

Source: scanlyapp 2026 Checkly alternatives.

  • Pricing: Free (10k API runs/mo); Team $64/mo (100k API runs, 12k browser runs).
  • Positioning: Monitoring-as-code, not E2E test authoring. Playwright scripts in git, run on schedule against prod.
  • Multi-user: Playwright’s multi-context capability available since tests are raw Playwright code.
  • Verdict for Ideony: 7/10 as a post-launch synthetic monitoring solution — run the booking happy-path every 5 minutes in prod, alert on failure. Not a multi-role development tool but a complement.

Public info: web + native app via AI-assisted test recorder, $99/mo starter, $450/mo Pro. No meaningful multi-role support beyond single-user flows.

Verdict: 3/10.

  • Functionize: Enterprise AI testing, no public pricing, no multi-role emphasis. 2/10.
  • DevCycle: Feature flag platform — not an E2E tool. Mis-categorized in the brief. Could be used for test-gating (flag-on-for-test-tenant) but that’s orthogonal. N/A.
ToolIdeony scoreReasoning
Checkly7Great for post-launch synthetic monitoring; not primary E2E tool
TestRigor5Explicit multi-user support, but opaque infra + $300/mo + lock-in
QAWolf4Managed quality but cost unknown, multi-mobile not their strength
Applitools3Orthogonal (visual only); already have Lost Pixel
Mabl / Reflect / Autify3Web-biased, mobile bolted on
Momentic3Web-only; mobile promised-but-absent
Functionize2Enterprise-only, no pricing transparency

None are >7. No AI tool offers a compelling out-of-the-box solution for Ideony’s specific cocktail: Expo (iOS+Android+Web from same codebase) + Socket.IO gateway + GPS streaming + two-actor coordination + Italian-first locale.


  • 2 mobile clients (consumer + pro) ± optional 3rd (admin web)
  • Real BE (NestJS + Postgres + Redis + Socket.IO), only Stripe/Clerk/Novu in test mode
  • Deterministic state seeds via existing pnpm seed:demo + future /test/scenarios/:name
  • 6 canonical scenarios (defined in project_multi_role_e2e.md)
  • IT + EN locales
  • Expo (iOS/Android/Web) from single codebase
  • GPS streaming (SOS dispatch)
  • WebSocket delivery assertions (chat, dispatch, tracking)
  • Push-notification timing (Novu → Resend/Twilio/Expo Push)
DimensionOSS custom orchestratorTestRigorQAWolf managedMabl / Autify
Initial eng cost3-4 wk senior eng1 wk integration + learning curve1 wk onboarding1 wk onboarding
Monthly cost€0 (self-hosted ARM64 runner exists)€300-900€500-2000 est.€250-450
Annual cost (yr 1)~€0 + time sunk~€5-11k~€6-24k~€3-5.5k
Multi-mobile orchestrationFull controlLimitedLimitedWeb-biased
GPS simulationFull control via backend APIBlocked (no runner access)Via custom PlaywrightVia custom Playwright
WebSocket assertionDirect socket.io-clientVia their DSL (uncertain fidelity)Via custom PlaywrightLimited
Debuggability100% (our code, our logs)Dashboard-basedDashboard-basedDashboard-based
Lock-inNoneHigh (their DSL)Medium (Playwright artifacts exported)High
Adding 7th scenario~1 day~1 day in DSLRequest new test (billed)~1 day
Skill on teamTS, Node, Socket.IO — already havePlain English — anyoneNone neededNone needed
Pre-revenue startup fitExcellentMarginalPoorPoor

Build. Rationale:

  1. Ideony already has the infrastructure: Maestro license (free tier, ARM64 runner), Playwright installed, NestJS backend with /test/* endpoints possible, Socket.IO gateway already instrumented.
  2. The 3-week eng cost front-loads cleanly into Phase E; AI tools shift cost to monthly burn without solving the multi-role problem fully.
  3. Debuggability matters more than anything when multi-role tests fail at 2am — our code, our logs, our traces beat a black-box SaaS dashboard every time.
  4. Exit cost: if in 2 years we want to migrate to QAWolf, our Maestro flows and Playwright specs are portable (they’re the same artifacts QAWolf would author).

Caveat: If cofounders want cofounder-level QA (i.e. non-technical PM writes tests), TestRigor becomes more attractive. But that’s a product-org decision, not a technical one.


LayerToolVersionRole
Mobile single-user runnerMaestro2.4.0+ (CLI)Per-device YAML flows, 122 flows exist
Web single-user runnerPlaywright@playwright/test 1.48+Web E2E, browser contexts for secondary multi-user
Native iOS-specific fallbackDetox20.xOnly where Maestro can’t reach (rare)
OrchestratorNode.js + TypeScript + VitestNode 22 LTS + Vitest 2.xMulti-role coordinator
Event bus (test)Redis pub/sub8.6 (existing)Sync primitive between runners
WebSocket client (integration layer)socket.io-client4.8+Direct multi-client assertions
State controlNestJS /test/scenarios moduleNew, gated by NODE_ENV=test + auth headerDeterministic seeds
VisualLost Pixel (existing)No change
GPS simulationCustom via BE /test/geo-feed endpointNewInject GPS into dispatch service, not into mobile OS
CIGitHub Actions + self-hosted ARM64 runnerexistingOrchestrates all above
ObservabilitySentry (BE) + run artifacts (screenshots, videos, WS transcripts)existing Sentry + new artifact bundlerFailure triage

Layered pyramid (bottom = fastest, broadest; top = slowest, narrowest):

+---------------------------+
| Multi-device Maestro | <-- 6 canonical scenarios
| orchestrated (CI only) | ~2-5 min each
+---------------------------+
+-----------------------------+
| Playwright 2-context web | <-- subset for web
| multi-role (local+CI) | ~30s each
+-----------------------------+
+-----------------------------------+
| Multi-client socket.io-client | <-- bulk of coordination
| integration tests (no UI) | ~1-3s each
+-----------------------------------+
+---------------------------------------+
| Single-service NestJS unit + e2e | <-- existing 300+ BE tests
+---------------------------------------+

Crucially: 80% of multi-role bugs can be caught at the “multi-client socket.io-client” layer without a single mobile runner. Two node processes, real BE, real Postgres, real Redis — assert on WS frames + DB state. Only the top 2 layers need the heavy orchestration.

Three seed mechanisms, used in combination:

  1. pnpm seed:demo — persistent demo data (existing). Baseline: 20 pros, 5 consumers, Rome-Milan-Turin, realistic distribution. Run once per CI job startup.
  2. POST /test/scenarios/:name — scenario-specific superposition. E.g. sos_burst_pipe_rome creates consumer test-c-1, pro test-p-1/2/3 each at specific lat/long within 10km of consumer, pending booking in state CREATED. Idempotent; wipes prior test-scoped rows for that name.
  3. POST /test/cleanup — end-of-test sweep. Deletes rows tagged with test_tenant=<uuid>; each multi-role test generates a fresh tenant ID at start.

DB isolation: all test rows get a test_tenant column (nullable for prod data). Middleware adds WHERE test_tenant = $1 OR test_tenant IS NULL to read queries during tests. Hard guardrail: deletes under test tenant require matching tenant header.

Clock control: POST /test/clock/advance?seconds=300 — advances a centralized mocked clock (ClockService.now() abstraction already partially exists for booking reminders). Avoids real 5-minute waits in SOS countdown tests.

Deterministic randomness: seeded RNG exposed via CryptoService. Test mode sets a known seed via POST /test/seed.

The critical design choice. Options considered:

  1. Polling REST — test A polls GET /test/state/booking/:id until status=ACCEPTED. Simple, reliable, slow (1s granularity). Chosen for happy-path sync.
  2. Redis pub/sub tap — test harness subscribes to events:bookings:* channel, awaits specific message. Low-latency (<50ms), deterministic. Chosen for timing-sensitive sync (GPS tracking, chat delivery).
  3. Shared WS bus — test spawns its own socket.io-client, subscribes to dispatch events. Chosen for assertions about what pros receive (multi-client WS test layer).

Rule: polling for state convergence; pub/sub tap for event-fired assertions.

4.5 Sample scenario walkthrough — Scenario 1

Section titled “4.5 Sample scenario walkthrough — Scenario 1”

“Consumer search → pro match → book → pay → complete”

File layout:

test/e2e-multi-role/
scenarios/
01-booking-full-cycle/
scenario.spec.ts <-- orchestrator entry
consumer.flow.yaml <-- Maestro flow for consumer app
pro.flow.yaml <-- Maestro flow for pro app
assertions.ts <-- shared DB + WS assertions
fixtures/
scenario-seed.json <-- extra seed overrides
lib/
orchestrator.ts <-- spawns + coordinates Maestro processes
test-api.ts <-- wraps /test/* endpoints
ws-tap.ts <-- Redis + socket.io listener helpers
sync.ts <-- waitForState, waitForEvent helpers
artifacts.ts <-- collects screenshots, logs, ws-transcripts on failure

Shape of scenario.spec.ts (illustrative, not runnable):

import { describe, it, beforeAll, afterEach } from 'vitest';
import { Orchestrator } from '../../lib/orchestrator';
import { TestApi } from '../../lib/test-api';
import { waitForBookingStatus, waitForWsEvent } from '../../lib/sync';
describe('Scenario 01: Booking full cycle', () => {
let orch: Orchestrator;
let api: TestApi;
let tenantId: string;
beforeAll(async () => {
api = new TestApi();
tenantId = await api.createTenant();
await api.seedScenario('booking_full_cycle_rome', tenantId);
orch = new Orchestrator({
consumer: { device: process.env.CONSUMER_DEVICE_UDID!, appId: 'app.ideony.consumer' },
pro: { device: process.env.PRO_DEVICE_UDID!, appId: 'app.ideony.pro' },
});
});
afterEach(async () => {
if (context.failed) await orch.collectArtifacts();
await api.cleanupTenant(tenantId);
});
it('consumer books, pro accepts, both complete', async () => {
// Phase 1: consumer searches and books
await orch.runFlow('consumer', 'consumer-search-and-book.yaml', {
TENANT_ID: tenantId,
EXPECTED_PRO: 'test-p-1',
});
const bookingId = await api.getLatestBookingId(tenantId);
// Phase 2: assert booking exists in PENDING_ACCEPTANCE + pro got push
await waitForBookingStatus(api, bookingId, 'PENDING_ACCEPTANCE', { timeoutMs: 5000 });
await waitForWsEvent('booking:new', { bookingId, proId: 'test-p-1' });
// Phase 3: pro accepts via their app
await orch.runFlow('pro', 'pro-accept.yaml', { TENANT_ID: tenantId, BOOKING_ID: bookingId });
await waitForBookingStatus(api, bookingId, 'ACCEPTED');
// Phase 4: simulate clock advance to scheduled time
await api.advanceClock(3600); // 1hr forward
// Phase 5: pro marks arrived, then completed
await orch.runFlow('pro', 'pro-arrive-complete.yaml', { BOOKING_ID: bookingId });
await waitForBookingStatus(api, bookingId, 'COMPLETED');
// Phase 6: consumer sees receipt + review prompt
await orch.runFlow('consumer', 'consumer-verify-completion.yaml', { BOOKING_ID: bookingId });
// Phase 7: final state asserts
const booking = await api.getBooking(bookingId);
expect(booking.status).toBe('COMPLETED');
expect(booking.paymentCapturedAt).toBeDefined();
});
});

Key properties:

  • Each orch.runFlow spawns a fresh maestro test --device <udid> --env TENANT_ID=… and awaits exit code.
  • Between Maestro invocations, Node-level assertions + BE API calls — fast, deterministic.
  • waitForWsEvent subscribes to Redis events:bookings:* channel, resolves on match or timeout.
  • Tenant isolation means scenarios can run in parallel on separate emulator pairs.
# .github/workflows/e2e-multi-role.yml (shape only, not production code)
name: E2E Multi-role
on: [pull_request, workflow_dispatch]
jobs:
orchestrated:
runs-on: [self-hosted, arm64, macos]
strategy:
fail-fast: false
matrix:
scenario: [01-booking, 02-sos, 03-cancel, 04-chat, 05-credentials, 06-rating]
timeout-minutes: 15
steps:
- uses: actions/checkout@v5
- run: pnpm install --frozen-lockfile
- run: pnpm docker:up # Postgres + Redis + MinIO + Mailpit
- run: pnpm --filter @ideony/api migrate deploy
- run: pnpm --filter @ideony/api seed:demo
- run: pnpm --filter @ideony/api start:test-mode & # exposes /test/* endpoints
- run: ./scripts/boot-emulator-pair.sh # 2 Android emulators
- run: pnpm --filter @ideony/mobile build:test-apk
- run: pnpm --filter @ideony/mobile install:test-apks
- run: pnpm test:multi-role -- --scenario ${{ matrix.scenario }}
- if: failure()
uses: actions/upload-artifact@v4
with:
name: artifacts-${{ matrix.scenario }}
path: test/e2e-multi-role/artifacts/

Runtime budget: each scenario 2-5 min. 6 scenarios × parallel workers (2 concurrent on ARM64 runner w/ 4 emulators) = ~15 min wall time. Acceptable gate for PRs touching apps/api/src/modules/{booking,sos,credentials} or apps/mobile.

Flakiness mitigation:

  • Retries: 2 retries per scenario in CI (but log + surface flake counts to Sentry)
  • Placebo tests (Uber’s trick): duplicate run of a scenario with no code change → measure raw flake rate per scenario
  • Screenshot + video + WS-transcript + BE-log bundle on failure, posted as PR comment
  • Emulator snapshots per-run rather than per-session (avoid accumulated state)

The failure triage loop is the make-or-break of multi-role testing. Design:

  1. Correlation ID per scenario runX-Test-Run-Id: <uuid> header on every request, stamped onto every NestJS log line via ClsModule, threaded into Sentry scope.
  2. WS transcript — orchestrator’s own socket.io-client logs every frame it sees, dumped to artifacts/<run-id>/ws-transcript.ndjson.
  3. Screenshots — Maestro --screenshot-on-failure + explicit takeScreenshot at key assertion points.
  4. Video — Maestro cloud video recording (free tier for up to N runs; self-host screenrecord as fallback).
  5. BE log bundle — on test failure, orchestrator calls GET /test/logs?run_id=<uuid> which returns structured logs from all BE services for that correlation ID.
  6. DB snapshot — on failure, pg_dump the test-tenant rows for post-mortem.
  7. Sentry replay — already configured for mobile; replay ID linked into failure artifact bundle.

Failure comment shape on PR:

Scenario 02 (SOS dispatch) failed at step 4: waitForBookingStatus timeout after 10s — expected ACCEPTED, got DISPATCHING. Artifacts: WS transcriptConsumer screenshotPro screenshotBE logsDB snapshotSentry event

Replaces the project_test_coverage_next_steps.md plan, retains its Phase E numbering:

  • E0 — Prerequisites (3 days)

    • /test/scenarios/:name endpoint (NestJS module, gated by process.env.TEST_MODE === '1' + shared secret header)
    • ClockService abstraction with advanceClock test-only method
    • test_tenant column + middleware
    • Orchestrator scaffolding (lib/orchestrator.ts, lib/test-api.ts, lib/sync.ts)
  • E1 — Single-user refresh (1 wk)

    • Refresh existing 122 Maestro flows against current main (Sole palette + NativeWind + SOS gateway changes)
    • Add onboarding wizard flow once PR ad665e05 lands
  • E2 — WebSocket integration layer (3 days) — highest ROI step

    • Multi-client socket.io-client integration tests for chat delivery, booking state broadcast, SOS cascade dispatch
    • Runs in same NestJS test harness as BE unit tests; no mobile runners
  • E3 — Scenario 1 (booking) + 5 (credentials) orchestrated (1 wk)

    • Exercises full orchestrator without the most complex features (no GPS streaming)
  • E4 — Scenario 2 (SOS full cycle) orchestrated (1 wk)

    • Most complex: /test/geo-feed endpoint for GPS injection, clock advance through countdown, cascade dispatch winner assertion
  • E5 — Scenarios 3, 4, 6 + CI wiring + Lost Pixel refresh cadence (1 wk)

    • Complete the canonical 6
  • E6 — Post-launch: synthetic canaries via Checkly (Free tier) against prod — deferred until v1 stable

Total: ~4 weeks of focused eng work. Monthly cost €0 (Maestro CLI + Playwright + Lost Pixel all free tiers). Optional €64/mo Checkly after launch.


Section 5 — Open questions for brainstorming

Section titled “Section 5 — Open questions for brainstorming”

Before locking the architecture, these need user input:

  1. Test tenancy vs ephemeral DB? The recommended pattern uses test_tenant column + middleware on a shared DB. Alternative: spin up a fresh Postgres via docker compose per CI job (heavier startup, cleaner isolation). Preference?

  2. GPS injection — backend API or device-level? Recommended: backend /test/geo-feed that bypasses mobile GPS entirely, pro app receives lat/long via dispatch WS event. Alternative: Android mock-location + Xcode simulator GPS. Backend approach is faster/simpler but tests less of the real GPS→app path. Acceptable tradeoff?

  3. Clock mocking scope? ClockService.now() abstraction touches Booking reminders, SOS countdown, reservation expiry. Retrofitting the whole codebase is ~1 day’s work. Alternative: scenario-specific fake clock via request header. The latter is hackier but isolates blast radius. Which?

  4. Cofounder-authored tests? TestRigor’s plain-English DSL would let non-technical cofounders write tests. Worth the €300+/mo + lock-in cost? Or keep tests eng-only?

  5. Parallel scenario execution — 2 or 4 emulator pairs? ARM64 Mac mini runner handles 2 pairs comfortably. 4 pairs would need a second runner (~€50/mo Hetzner). Worth it for wall-time reduction, or accept 15-min gate?

  6. Visual regression on multi-role scenarios? Lost Pixel integrated at assertVisible points inside flows, OR a dedicated non-multi-role visual pass? The latter is cheaper and avoids flake amplification but misses visual regressions that only appear during cross-user flows (e.g. chat bubbles rendering).

  7. Maestro Cloud vs local orchestration? Maestro Cloud offers parallel device farm for $$. Self-hosted saves money but caps parallelism. Start self-hosted, escalate if scenario 6+ times out?

  8. Fail-fast policy? On first scenario failure in a matrix run, abort remaining scenarios to save CI minutes? Or always run all 6 for full diagnostic? Recommend the latter despite cost (6×2min = 12 emulator-minutes is cheap; diagnostic value is high).

  9. Real Stripe/Clerk/Novu in tests? Stripe test-mode + Clerk dev tenant + Novu dev env have rate limits. Hitting them per-PR could throttle CI. Recommend: mock Clerk JWT (fixed signing key in test mode), use Stripe test cards, Novu dev env with retries. Is that acceptable, or should we mock all three externals entirely?

  10. Multi-role post-launch — extend to 3-actor (consumer + pro + admin approving credentials live)? Scenario 5 as currently scoped has admin approval offline. A 3-actor version would validate the full admin-review flow but needs a 3rd emulator or a Playwright browser for the admin web UI. Scope in Phase E or defer?


SOTA companies:

Tooling:

AI tool pricing + reviews:

GPS simulation:


End of report.