Internal Engineering

ReviewData

Architecture & Engineering Onboarding Guide

A deep technical walkthrough of the platform — the runtime topology, the stack, the design patterns, the async model, the Temporal orchestration, and the queue engine that scales it — plus a study roadmap to get productive fast.

FastAPI · async Python 3.12 SQLAlchemy 2.0 + Postgres Temporal Redis · S3 · CloudFront Next.js 15 Admin UI 15 pages
Page 1 · Orientation

System at a Glance

What the system does, and the shape of it in one screen.

ReviewData (RD) is backend integrator infrastructure. It has no customers of its own — other platforms (called integrators, chiefly Reputation Management / RM) call its API to get four jobs done, and RD reports back over webhooks.

OperationKindHow RD handles it
Scrape reviewsLong-runningTemporal workflow → reviews written to S3 → webhook.
Find publisher URLsLong-runningTemporal workflow → URL written to DB → webhook.
Generate AI responsesIn-processA single OpenAI call in the API service — no Temporal.
Post responsesLong-runningTemporal workflow (or a human via QA) → webhook.

The whole system on one diagram

Integrator (RM) Internal staff QA staff │ API key │ JWT login │ Chrome extension ▼ ▼ ▼ ┌──────────────────────────────────────────────────────────────────────────────┐ │ FastAPI service (async, ASGI) │ │ /api/v1 integrator endpoints · /admin · /qa · /internal (extension) │ └───────┬───────────────┬───────────────┬───────────────┬───────────────────────┘ │ read/write │ start wf │ cache/locks │ AI (OpenAI) ▼ ▼ ▼ ▼ ┌─────────┐ ┌──────────────┐ ┌────────┐ (in-process call) │ Postgres│ │ Temporal │ │ Redis │ │ (schema)│ │ (durable │ └────────┘ └────▲────┘ │ workflows) │ │ └──────┬───────┘ │ same DB │ polls task queues │ ┌──────┴──────────────────────────────┐ │ │ Temporal Workers (Workflow Team) │ └────────┤ scrapers · posters · url-finders │──► S3 (.jl reviews) ─► CloudFront ─► URLs │ + Platform's webhook-delivery worker │──► HTTP POST ────────► Integrator webhook └──────────────────────────────────────┘
The one-sentence mental model The FastAPI service accepts work and hands off a durable ticket (a job row + a Temporal workflow); Workers do the heavy lifting and write results back to the same database and to S3; webhooks tell the integrator it's done. Everything is correlated by request_id and task_id.

Explicitly out of scope

No customer SPA · No billing (that's RM) · No review content in the DB (reviews live in S3; the DB stores URLs + metadata) · No scraper/poster implementation here (that's the Workflow Team — Page 4).

Page 2 · Orientation

The Technology Stack — a Roadmap

Read this top-to-bottom as a layered map. Each layer sits on the one below it; learn them in this order.

① Edge / HTTP
FastAPI ≥0.115 on Uvicorn (dev) / Gunicorn + Uvicorn workers (prod). Async ASGI framework; auto-generates OpenAPI at /docs. This is where every request enters.
② Contracts
Pydantic v2 — validates every request and response. The typed boundary. SQLAlchemy models are never returned directly; everything crossing HTTP is a Pydantic schema.
③ Business logic
Service layer (plain async Python classes). All rules, orchestration, and transactions. Calls out to the DB, Redis, S3, HTTP, and Temporal via injected clients.
④ Data access
SQLAlchemy 2.0 (async) + asyncpgPostgreSQL. Migrations via Alembic. Mapped/mapped_column/AsyncSession; postgresql+asyncpg://. Sync psycopg2 is forbidden.
⑤ Orchestration
Temporal (Python SDK) — durable, retryable workflows for scrape / post / URL-find / webhook delivery. State survives crashes and redeploys. Workers poll task queues.
⑥ Infra services
Redis 7+ (cache, locks, rate-limit counters, cookie/session store) · AWS S3 (review .jl files, log archive) · CloudFront (CDN in front of S3).
⑦ Cross-cutting
loguru structured JSON logs + contextvars correlation IDs · Sentry errors · Prometheus metrics · Argon2id/JWT/HMAC/Fernet security.

Auth & security, at a glance

ConcernChoiceWhere used
Admin/QA loginJWT (access + refresh)Admin/QA UI; also an extension-type JWT for the Chrome extension.
Integrator authAPI key — SHA-256 hashed at restIn the request payload; looked up by prefix, then hash-verified.
Password hashingArgon2id (argon2-cffi)User passwords.
Webhook integrityHMAC-SHA256 (X-RD-Signature)Signs every outbound webhook body.
Secret-at-restFernet symmetric encryptionPublisher credentials, cookies, OAuth tokens.

Tooling & the frontend

Testing

pytest + pytest-asyncio (auto), httpx.AsyncClient over in-process ASGI, factory_boy + Faker, coverage gates in CI.

Quality

ruff (lint + format), mypy (strict on app/), pre-commit hooks. pip + pinned requirements.txt.

Ship

Docker + docker-compose (dev), multi-stage builds for prod, Bitbucket Pipelines CI/CD, AWS ECR + ECS in prod.

Admin/QA UI

Next.js 15 + React 18 + TS (App Router), MUI v6, Zustand (UI state), TanStack Query v5 (server state), RHF + Zod, Recharts.

Banned by convention (don't reach for these) requests → use httpx · psycopg2 in async paths → use asyncpg · Celery → replaced by Temporal · print() → use the logger · returning ORM models from handlers → always go through Pydantic · time.sleepawait asyncio.sleep.
Page 3 · Orientation

Runtime Architecture & Processes

The stack (Page 2) is what the code is made of. This is which processes run and how a request travels across all of them.

The independent processes

ProcessOwnerResponsibility
API service (Gunicorn/Uvicorn)PlatformAll HTTP: integrator REST, Admin/QA, extension endpoints. Starts workflows. Runs the in-process AI Response call and the background queue-manager loops.
Temporal serverSelf-hostedDurable state store + task-queue broker. Nothing business-specific lives here.
Scraper / poster / URL-finder workersWorkflow TeamPoll task queues, execute the actual publisher logic (Playwright, HTTP, captcha), write results.
webhook-delivery workerPlatformThe one worker Platform owns — runs the outbound webhook delivery workflow + retry schedule.
PostgreSQLSharedSingle source of truth. Both teams read/write; only Platform writes migrations.
RedisSharedEphemeral state: rate-limit counters, distributed locks, cookie/session store, scraper session tokens.
Key insight: the API service and the workers are different processes The API service never runs a scrape or a post. It writes a job row, commits, and starts a Temporal workflow. A separate worker process picks that up. They only meet through Postgres (the job row) and Temporal (the workflow). This decoupling is why a slow scrape can't block an HTTP request, and why deploying the API doesn't interrupt in-flight scrapes.

Lifespan singletons — one client per process

Expensive clients are created once at startup in FastAPI's lifespan context manager and stored on app.state; dependencies pull from there. Creating a new DB engine / HTTP client / Redis pool per request would waste connections and defeat pooling.

@asynccontextmanager
async def lifespan(app: FastAPI):
    app.state.db_engine = create_async_engine("postgresql+asyncpg://...")
    app.state.redis     = Redis.from_url(...)          # one shared pool
    app.state.http      = httpx.AsyncClient(timeout=10.0)  # reuse TCP conns
    app.state.temporal  = await Client.connect(...)
    yield
    await app.state.http.aclose(); await app.state.redis.aclose()
    await app.state.db_engine.dispose()

How one request travels the whole system

1. HTTP in 2. Controller (parse + auth) 3. Service (validate, write job row, commit) 4. Start Temporal workflow 5. Return task_id (202)
6. Worker polls queue 7. Runs activities (scrape/post) 8. Writes result to DB + S3 9. Fires webhook (via delivery worker) 10. Integrator notified

Steps 1–5 happen in milliseconds inside the API service. Steps 6–10 happen later, durably, in the worker fleet. The integrator bridges the gap by either polling a retrieve-* endpoint or waiting for the webhook.

Page 4 · Orientation

Two Teams, One Database

The single most important organizational fact. Internalize the boundary before you write code.

Platform Team — this repo

The contract surface: REST API, RBAC, Admin/QA UI, webhook delivery, the DB schema + all migrations, every Pydantic model, the extension's auth, and the AI Response service.

Workflow Team — separate repo

The deep publisher logic: scrapers, posters, URL-finders — browser automation (Playwright), captcha solving, proxy rotation. Runs as Temporal workflows + activities.

The three shared contracts

  1. Database schema. Source of truth = this repo's Alembic migrations. The Workflow Team imports the same models but never writes migrations; schema changes route through Platform review.
  2. Webhook events. Event names, payload shapes, and the FailureReason enum are Platform-owned. To fire a webhook, the Workflow Team calls a shared emit_webhook activity — never their own HTTP code.
  3. Workflow IDs. task_id is generated by Platform at request time; the workflow is started with ID "{type}-{task_id}", so the Admin UI can deep-link into Temporal Web and the Workflow Team can look up job context by ID.

Who writes each row

StepOwner
Validate request, create job row, set initial status, start workflowPlatform
Update status during execution; write final result (S3 URLs, errors)Workflow Team
Fire the completion webhook (via the shared emit service)Workflow Team
Boundary anti-patterns (these break the contract) Workflow Team creating migrations · Platform writing to scrape_result_batches.s3_url · Workflow Team calling FastAPI endpoints internally (they have direct DB access — endpoints are external-only) · either team rolling its own webhook HTTP · reading/writing the other team's tables without an explicit shared contract.
As an intern on Platform You will not write scraper/poster code. Your surface is: endpoints, services, schemas, migrations, the QA board, the queue manager, and the UI. When a task touches worker behavior, you change the contract (schema/enum/status), and the Workflow Team changes their code against it.
Page 5 · How the Code Is Built

The MCS Layering

Model → Controller → Service. Strict, and non-negotiable. It is the rule you will be reviewed against most often.

HTTP request │ ▼ [ Controller ] FastAPI route. Parse · authenticate · call ONE service · return. ≤ 20 lines │ No business logic. No DB queries. No transactions. No HTTPException. ▼ [ Service ] ALL business logic + orchestration + transactions (commit here). │ Raises domain exceptions (AppError subclasses), never HTTPException. ▼ [ (Repository) ] DB queries — DEFERRED today; services call select()/session.get() directly. │ ▼ [ Model ] SQLAlchemy tables + Pydantic schemas. Data shapes only. Zero logic.

Why the layering earns its keep

Testability

A service is a plain class wrapping a session — unit-test it without booting FastAPI or hitting HTTP.

Maintainability

Thin controllers physically can't accumulate hidden logic; the ≤20-line rule forces it into services.

Consistency

Every endpoint looks identical, so any file is readable on first open.

Clean errors

Services speak domain language (NotFoundError); one global handler turns that into HTTP + the envelope.

The canonical shape

Controller — wiring only:

@router.post("/admin/users/{user_id}/disable", response_model=UserOut,
             dependencies=[require_permission("user.manage")])
async def disable_user(user_id: uuid.UUID,
                       session: AsyncSession = Depends(get_session),
                       auth: AuthContext = Depends(get_auth_context)) -> UserOut:
    return await UserService(session).disable_user(user_id, auth.user_id)

Service — logic + the transaction:

class UserService(BaseService):
    async def disable_user(self, user_id, actor_id) -> User:
        user = await self.db.get(User, user_id)
        if user is None:
            raise NotFoundError("User not found")   # domain exception
        user.is_active = False
        user.disabled_at = now_dt(); user.disabled_by = actor_id
        await self.commit()                          # transaction lives HERE
        await self.db.refresh(user)
        return user

Domain organization (not type organization)

Controllers, services, and schemas each split into the same seven domains: core, admin, qa, integrator, posting, scraping, internal. Only models stay flat (app/models/*.py, one file per table). External I/O (HTTP, S3, Redis, Temporal) goes through dedicated client classes injected at the service layer.

The five violations fixed in every review ① transactions in controllers · ② business logic / DB queries in controllers · ③ HTTPException in services · ④ imports inside functions · ⑤ inline BaseModel in controllers. Also modern typing only (X | None, list[X]), and from __future__ import annotations first in every file.
Page 6 · How the Code Is Built

Design Patterns in RD

The recurring shapes you'll see everywhere. Recognize them and new code writes itself.

1 · Service object over a session

Every service extends BaseService and is constructed inline as UserService(session) — effectively zero-cost (one attribute set), stateless. Per-service Depends() factories were deliberately rejected for now: they double indirection without earning testability. The service owns the transaction; the controller owns nothing but the HTTP call.

2 · Exception hierarchy → single envelope

All errors subclass AppError (which carries an HTTP code, a stable error_code, and a message). Services raise them; a global handler renders the one true envelope. Clients branch on the machine-readable error_code, never on prose.

class AppError(Exception): code = 500; error_code = "INTERNAL_ERROR"
class NotFoundError(AppError):  code = 404; error_code = "NOT_FOUND"
class ConflictError(AppError):  code = 409; error_code = "CONFLICT"
# domain-specific extends the base:
class DuplicateLocationError(ConflictError): error_code = "DUPLICATE_LOCATION"

3 · Pydantic contract at the boundary

Request schema in, response schema out — and they are separate types. A request schema validates untrusted input; a response schema (with ConfigDict(from_attributes=True)) hydrates from an ORM row. Integrator responses inherit a StandardEnvelope (code + status) as their first two fields; pagination uses a generic Paginated[T] wrapper.

4 · Idempotency at every async boundary

Because Temporal retries and integrators retry, every retryable operation must be safe to run twice. The patterns:

BoundaryIdempotency key
Starting a workflowDeterministic ID scrape-{task_id} → a duplicate start raises WorkflowAlreadyStartedError, caught & treated as success.
DB writes in an activityINSERT ... ON CONFLICT DO NOTHING/UPDATE.
S3 writesDeterministic key {request_id}/{task_id}-batch-{n}.jl — overwrite is safe (same content).
Outbound webhookdelivery_id — integrator dedupes on its side.
Integrator retry of a POSTDe-dup on (api_key, foreign_key, endpoint) for a short window.

5 · Correlation via contextvars

A middleware stamps request_id (and task_id/user_id when in scope) into a contextvar; the logger reads it automatically, so every log line across the async call stack is correlated without threading an argument everywhere.

6 · Lifespan singletons + explicit DI

Clients live on app.state (Page 3); the session is injected per-request via Depends(get_session)one session per request, committed at the service boundary, never in the controller. Permission gates are declarative dependencies: dependencies=[require_permission("...")].

Soft-delete, always No hard deletes anywhere — deactivate with is_active=False or stamp deleted_at. History is a feature, not clutter.
Page 7 · How the Code Is Built

The Async & Concurrency Model

The whole stack runs on one event loop per process. A single blocking call stalls every concurrent request — treat sync I/O like a memory leak.

The golden rule If a function does I/O — DB, HTTP, file, Redis, S3 — it is async def. If it calls an async def, it must await it. No exceptions on hot paths.

The tools by resource

ResourceUseNever
DatabaseAsyncSession, session.scalar()/scalars(), postgresql+asyncpg://session.query() (1.x style), psycopg2
HTTPhttpx.AsyncClient, reused from app.staterequests; a new client per call
Redisredis.asyncio via RedisService, one shared poolthe sync redis API; a client built ad-hoc
S3aioboto3 (or boto3 inside asyncio.to_thread)blocking boto3 on the loop
Sleep / locksawait asyncio.sleep(), asyncio.Locktime.sleep(), threading.Lock

Concurrency patterns you'll use

# Fan out in parallel
results = await asyncio.gather(fetch_a(), fetch_b(), fetch_c())

# Bounded concurrency (don't stampede a dependency)
sem = asyncio.Semaphore(10)
async def bounded(x):
    async with sem:
        return await process(x)
results = await asyncio.gather(*(bounded(x) for x in items))

# Deadline
async with asyncio.timeout(5):
    result = await slow_call()

# Unavoidable CPU-bound / sync code → offload the thread
result = await asyncio.to_thread(blocking_parse, blob)

Choosing where background work runs

WorkMechanism
Short, non-critical (cache warm, fire-and-forget log)asyncio.create_task(...) in the handler
Durable, retryable, multi-step (scrape, post, webhook)Temporal workflow
Recurring poll loops (the queue managers)Long-lived task started in lifespan
Anything you'd have used Celery forTemporal. Celery is banned.
Classic bugs that break async forgetting await (you now hold a coroutine, not the value) · a sync DB driver stalling the loop for all requests · a fresh httpx.AsyncClient per call (kills connection reuse) · asyncio.run() inside app code (the ASGI server owns the loop) · nest_asyncio (forbidden).
Page 8 · How the Code Is Built

The Data Model

The database is the shared backbone between both teams. These are the entities and identifiers you'll meet daily.

The location hierarchy

Integrator (has an API key, and a scrape_priority tier) └── Location business at one address · external_location_id = RM's business.id └── Site Location × Publisher · holds publisher_url + url_status └── PublisherAccount auth state for posting: credentials / cookies / OAuth └── AccountSession live session token + consecutive-failure counters

Configuration & identity

PublisherConfiguration

One row per publisher_key — the routing brain. Decides scraper engine (internal SAU vs external LDE), posting mode (AUTOMATED / MANUAL_QA / DISABLED), and auth type (OAUTH / CREDENTIALS / COOKIES).

IdentityAddress

A pooled email alias for the Business Manager (BM) flow — a publisher emails a "manager invite" QA must accept. Aliases rotate out of the pool after a confirmation threshold.

Jobs & cross-cutting tables

Job tables (Platform creates the row; Workflow Team writes results): scrape_jobs, posting_jobs, url_finder, ai_response_job, competitive_report_job, insight_report_job. Cross-cutting: inbound_request (every payload archived, keyed by request_id) · webhook endpoints/subscriptions/deliveries · users/roles/permissions (RBAC) · api_keys · qa_task/qa_task_event/qa_stream_config · jwt_token.

The identifiers (know the difference cold)

IdentifierGenerated byPurpose
request_idPlatform (inbound)Correlates one request → all downstream work + every log line. Never null.
task_idPlatform (job create)The job ID; reused verbatim as the Temporal workflow ID.
delivery_idPlatform (per attempt)Changes on each webhook retry; the integrator's idempotency key.
foreign_keyIntegratorRM's own reference; RD only echoes it back.
external_location_idIntegratorRM's business.id.
internal_business_idDiscovered while scrapingPublisher-side business ID (e.g. place_id); required to post.
internal_review_idDiscovered while scrapingPublisher-side review ID; sent to the publisher when posting a reply.
Page 9 · Orchestration & Scale

Temporal Orchestration

Temporal is how RD runs work that must survive crashes, retries, and 24-hour human waits. This is the engine room.

Workflows vs activities

Workflow

Deterministic orchestration code. Its state is durably replayed after any crash. No direct I/O, no randomness, no wall-clock — use workflow.now(), workflow.uuid4(), workflow.random().

Activity

The side-effecting unit — HTTP calls, DB writes, S3 uploads, Playwright. Retried automatically by Temporal, so it must be idempotent.

Conventions the Platform team enforces

Human-in-the-loop: signals + wait conditions

When a workflow needs a human (a URL entered, cookies captured), it waits — durably, for up to 24h — on a signal. This is how the QA board plugs into a running workflow.

@workflow.signal
async def provide_url(self, payload: UrlProvidedPayload):
    self._provided_url = payload.url

# elsewhere in the workflow body — always with a timeout:
await workflow.wait_condition(
    lambda: self._provided_url is not None,
    timeout=timedelta(hours=24),
)

How Platform triggers a workflow (from a service)

job = await self.repo.create(...)
await self.session.commit()                       # commit FIRST
await self.temporal.start_workflow(
    "ScrapeWorkflow", ScrapeWorkflowInput(scrape_job_id=job.id, ...),
    id=f"scrape-{job.id}",
    task_queue=route_to_task_queue(payload.publisher_key),   # Page 10
)
Advanced patterns you'll encounter in the wild RD's scrape flow has evolved to a split flow for some publishers (a short browser "lookup" workflow re-buffers the job for a cheap HTTP scrape stage), session refresh activities that re-mint blocked cookies, and re-buffer-with-backoff instead of long in-workflow retries — all so scarce browser slots aren't held during cheap HTTP work. You don't need this on day one, but it's why the queue engine (Page 11) exists.
Page 10 · Orchestration & Scale

Task Queues & Worker Topology

Temporal routes work by task queue. Different work has wildly different cost profiles, so it gets different queues and different concurrency limits.

The queues

QueuePurposeConcurrency profile
scraping-internalInternal SAU scraper workflowsnormal
scraping-ldeLDE-routed scrapesnormal
scrape-<publisher>Per-publisher scrape workflows + their HTTP activitiesnormal (per queue)
scraping-browserActivity-routing target: only the browser-backed activities hop herebounded by the browser-api session envelope (~12)
postingResponse posting (most common)normal
posting-browserCookie/credential posting (Playwright)1 per worker — Playwright is heavy
url-finderURL discoverynormal
webhook-deliveryOutbound webhook delivery + retryhigh — horizontally scaled
maintenanceCron-style: cookie health, alias rotation, archivallow

Two concepts that trip people up

Activity-level routing (the "browser tier") A scrape workflow starts on scrape-<publisher> and its cheap HTTP activities stay there — but its expensive browser-backed activities hop onto the shared scraping-browser pool. Why: one browser-api container handles only ~15–20 concurrent Chrome sessions, and you don't want that scarce resource throttling cheap HTTP. The browser pool is a global cap (~10–12); the HTTP work scales wide independently.
"Queue" is an overloaded word — mind which one ① a Temporal task queue (this page — how workers pick up work); ② a webhook queue variant (a sub-classifier on an event, e.g. DATA_RESULT:REVIEW_DATA vs :LDE_DATA); ③ the scrape "queue" backlog of BUFFERED job rows in Postgres that the queue manager dispatches (Page 11). They are three different things.

Worker fleet shape (prod)

Roughly ~30 task-queue workers run inside worker processes on ECS (EC2, c6i.2xlarge, bin-packed ~3 tasks/machine). The fleet scales 1 → 12 tasks across up to 4 machines. Crucially, it does not scale on CPU/memory (HTTP-light scraping sits at ~13% CPU) — it scales on queue backlog, which is the whole subject of the next page.

Page 11 · Orchestration & Scale

The Queue Engine: Dispatch, Fairness & Autoscaling

RD's most sophisticated Platform-owned subsystem. It decides how many workflows run, which jobs get a slot, and how big the worker fleet grows. Learn it — it's the interview-grade part of the codebase.

The problem it solves

Scrape submits are cheap and instant, but scrapes are slow and resource-bounded. If every submit started a live Temporal workflow immediately, a burst of thousands would flood one event loop, starve the pollers, and freeze the fleet (this actually happened in prod). So jobs land in Postgres as BUFFERED rows, and a single background queue manager meters them into Temporal.

The control loop

BUFFERED job rows ──► queue_manager publishes OutstandingScrapeJobs ──► CloudWatch │ ▼ ECS step-scaling policy on the worker service │ desired worker tasks: 1 ◄─► 12 │ queue_manager reads live worker count ◄── Temporal WorkerHeartbeat (list_workers) │ ▼ dynamic gate = (50 × running_workers) + 10 browser ──► dispatch that many into Temporal │ ▼ each worker sees a safe ~50 HTTP + shared 10 browser → drains → buffer shrinks → scale in

The five moving parts

  1. Dynamic gate. The cap on concurrently-running workflows is recomputed each cycle as PER_TASK_HTTP_CONCURRENCY (50) × running_worker_count, plus a global browser cap of 10. So 1 worker → gate 50; 12 workers → gate 600 — each task always sees a safe per-task load regardless of fleet size. Worker count comes from Temporal's WorkerHeartbeat registry (presence-based, so a saturated worker is still counted — an activity-poller count under-reported the busy fleet by ~45%).
  2. Backlog metric. ECS scales on OutstandingScrapeJobs = BUFFERED + PENDING_SCRAPE + QUEUED + IN_PROGRESSall non-terminal work, not just waiting depth (scaling in while heavy workflows were still running once stranded ~150 of them).
  3. Fair dispatch. Chairs are handed out round-robin across publishers that have buffered work — one ROW_NUMBER() OVER (PARTITION BY publisher_key) query — so a 1,000-job flood of one publisher can't head-of-line-block a single job of another.
  4. Holding ceiling. Under contention (≥2 active publishers), no single publisher may occupy more than ~60% of the gate — reserving headroom for a newcomer. A lone publisher gets 100% (no capacity wasted).
  5. Integrator priority tiers. Jobs carry a denormalized scrape_priority (lower = higher; production = 100, test/staging raised to 1000). Dispatch is priority-first: the whole production tier drains before any test-tier job gets a slot — but test load still counts toward autoscaling, so it's a first-class citizen for capacity, it just yields dispatch order.

The same idea, applied to LLM reports

Insight (C32) and Competitive (C34) reports reuse the exact pattern: a BUFFERED admission state + a ReportQueueManagerService, gating at the process-start boundary (not submit) — so even the "600 scrapes finish at once → 600 resume hooks fire" thundering herd gets metered. Two independent lanes (insight / competitive, ~40 concurrent each, since they run on separate worker fleets), with a shared OpenAI backpressure belt that honors 429/Retry-After above both.

Why this is worth studying It's a compact, real-world lesson in distributed systems: admission control, fairness, priority, presence-based health signals, and scaling on the right metric. Nearly every non-obvious decision here came from a production incident — the changelog in the spec doc reads like a post-mortem series.
Page 12 · Orchestration & Scale

Request & Webhook Flows

The external contract: how integrators submit work and how RD reports back.

The inbound REST surface

API key in the payload; every response wraps the standard envelope. Async endpoints return 202 Accepted + a task_id.

EndpointKindDoes
POST /request-reviewsAsync → task_idKick off a scrape.
POST /retrieve-taskSync, paginatedRead scrape results.
POST /get-publisher-infoSyncLook up publisher/account info.
POST /submit-responseAsync → task_idPost a reply to a review.
POST /retrieve-posting-dataSync, paginatedRead posting results.
POST /ai-responseSync/async (per config)Generate an AI reply (in-process OpenAI).
POST /url-finder/per-business · /file-basedAsync → task_idURL discovery, single or bulk.

A scrape, end to end

  1. RM calls POST /request-reviews; Platform authenticates the key and archives the payload in inbound_request (request_id).
  2. A scrape_jobs row is created (BUFFERED) with a fresh task_id; the endpoint returns 202 + task_id.
  3. The queue manager (Page 11) meters the job into a Temporal workflow scrape-<task_id> on the right queue.
  4. The Workflow Team's scraper runs — internal (SAU) or external (LDE) per PublisherConfiguration — and writes reviews to S3 as .jl files (DB stores only CloudFront URLs + metadata).
  5. On completion the workflow updates the row and fires EVENT.DATA_RESULT via the shared emit_webhook activity.
  6. RM receives the webhook (or polls /retrieve-task).

Outbound webhook events

EventFires when
EVENT.URL_UPDATEDA publisher URL is resolved (auto or QA).
EVENT.DATA_RESULTA scrape completed.
EVENT.CREDENTIAL_UPDATED · EVENT.COOKIE_UPDATEDQA added credentials / captured cookies.
EVENT.RESPONSE_SUBMISSIONA posting attempt finished (success or failure).
EVENT.PUBLISHER_DISCONNECTEDAn account got disconnected during posting.

Delivery is itself a durable workflow

Webhook delivery isn't a fire-and-forget HTTP call — it's a Temporal workflow (Platform-owned, on the webhook-delivery queue) that loops an 8-attempt retry schedule (workflow.sleep() between attempts) over activities http_post_webhookrecord_attemptmark_dead_letter. After 8 failures it's dead-lettered and surfaced in the Admin UI for manual replay. Each attempt carries a fresh delivery_id; the body is HMAC-signed.

Page 13 · Operations

Posting, QA & Error Handling

The human-in-the-loop system, and how failures are shaped into a clean contract.

Posting: mode × auth type

Modes

AUTOMATED — the poster workflow logs in and posts (Yelp, Booking). MANUAL_QA — routed to a human via a QA stream (OAuth publishers like Google/Facebook). DISABLED — off.

Auth types

OAUTH — token-based. CREDENTIALS — Playwright login, session captured. COOKIES — QA pre-captures session cookies via the extension; the poster replays them.

The QA Task Board — five streams

  1. URL — QA enters a publisher URL for a site missing one.
  2. Responses — QA manually posts a reply (MANUAL_QA / escalations).
  3. Poster Waiting — waiting on the client to accept a manager invite.
  4. Poster Credentials — QA enters publisher login credentials.
  5. Poster Cookies — QA captures cookies via the Chrome extension.

Tasks have lifecycle events (qa_task_event), are claimed/released by operators, scoped per user via stream configs, governed by qa.task.* permissions. An SLA sweeper flags stale tasks; reconcilers keep the board consistent with the underlying jobs. Completing a QA action signals the waiting Temporal workflow (Page 9) — that's the bridge between the board and the engine.

Error handling: one envelope, always

Services raise typed AppError subclasses; a global handler renders exactly this — raw FastAPI/Pydantic/SQLAlchemy errors never reach a client, and stack traces go to logs, not responses.

{
  "code": 409,
  "status": "failed",
  "error": "A location with this external_location_id already exists.",
  "details": { "external_location_id": ["already exists"] },
  "meta": {
    "request_id": "9f9d2b5a-...",
    "timestamp": "2026-04-28T14:23:11Z",
    "error_code": "DUPLICATE_LOCATION"
  }
}

Observability

Structured JSON logs (loguru) carry request_id/task_id/user_id on every line via contextvars; logs ship to S3 (daily partitions) + Datadog. Sentry catches exceptions — note a bare logger.error() does not reach Sentry; you must capture_exception(...). Prometheus exposes metrics.

Page 14 · Study Roadmap

Foundations to Master First

A week-by-week path. Each item says why it matters here and what "done" looks like.

Week 1 — Language & async fundamentals

  1. Python async/await & the event loop (Page 7). Coroutines, gather, Semaphore, timeout, why blocking calls are banned.
    Done when: you can explain why one sync DB call stalls every concurrent request, and write a bounded-concurrency fan-out.
  2. Modern typing. list[X], X | None, Annotated; from __future__ import annotations. RD runs mypy strict.
    Done when: you can read a fully-typed service method and know every shape without running it.
  3. HTTP & REST. Methods, status codes (esp. 202 for async), idempotency, API-key vs bearer auth.
    Done when: you can say why /submit-response returns a task_id instead of the final result.

Week 2 — The frameworks

  1. FastAPI — routers, Depends, response_model, declarative permission gates. Skim the live /docs.
    Done when: you can trace a request URL → controller → service → response and name each layer's job. Ref: backend/03-fastapi-conventions.md.
  2. Pydantic v2BaseModel, Field, validators, BaseSettings, from_attributes. Understand why request and response schemas are separate types.
    Done when: you can write both and explain why models never cross HTTP. Ref: backend/04-pydantic-schemas.md.
  3. SQLAlchemy 2.0 async + AlembicMapped, AsyncSession, select()/scalars(), relationships, migrations.
    Done when: you can read a model file, describe its table + FKs, and explain alembic upgrade head. Ref: backend/05-sqlalchemy-and-alembic.md; skills: alembic-migrations, run-migration.

Week 3 — What makes RD special

  1. Temporal (Pages 9–10) — workflows vs activities, durability, retries, task queues, signals/wait-conditions, the {type}-{task_id} convention.
    Done when: you can explain what happens if a worker crashes mid-scrape and why the job survives. Ref: backend/06-temporal-conventions.md; skill: temporal-workflows.
  2. The MCS layering + design patterns (Pages 5–6) — re-read 01-architecture/01-mcs-architecture.md until you spot a violation instantly.
    Done when: given a controller with a DB query in it, you name the broken rule and where the code moves.
  3. The queue engine (Page 11) — read 01-architecture/06-worker-queue-driven-autoscaling.md. It's dense but it's the crown jewel.
    Done when: you can explain the dynamic gate, why it scales on backlog not CPU, and what fair dispatch prevents.
  4. Errors, logging, testing (Page 13) — the envelope, the exception hierarchy, correlation IDs, why logger.error misses Sentry.
    Refs: backend/07-error-handling.md, 08-logging.md, 09-testing.md.
Page 15 · Study Roadmap

System Understanding & First Tasks

Being able to narrate these flows out loud is the real test that you understand RD.

Trace these end-to-end (on paper, then in the code)

  1. A scrape/request-reviewsBUFFERED → queue manager → workflow → S3 → EVENT.DATA_RESULT. Name where the job row is created, who owns each status, and how RM finds out.
  2. A posting request/submit-response for an AUTOMATED publisher (poster workflow) vs a MANUAL_QA one (spawns a QA task, workflow waits on a signal).
  3. A cookie capture — QA logs in → extension captures cookies → internal endpoint stores them encrypted → EVENT.COOKIE_UPDATED → poster replays them.
  4. A webhook retry — walk the 8-attempt schedule to dead-letter and the admin replay.
  5. A burst — 2,000 scrapes submitted at once: why they don't flood Temporal, how the gate + autoscaling drain them, how fairness protects a small integrator.

Component specs to read, in order

Hands-on ramp (clear each with your mentor)

  1. Green baseline. Bring up docker-compose; hit /docs; alembic upgrade head; run pytest.
  2. Read one vertical slice. Pick a small admin endpoint; read controller → service → model → schema → test.
  3. Write a test. Add one for an existing endpoint with the AsyncClient fixture + a factory_boy factory.
  4. Ship a tiny spec-backed change. e.g. add a read-only field to a response schema (with the doc's blessing) + migration + test + changelog line — the full PR + doc-sync loop.
  5. Shadow a QA flow. Watch a real task move through a stream in the Admin UI.

Habits that make you effective here

Ground every claim

Cite path:line; if unsure, read or grep first. This repo takes anti-hallucination seriously.

Spec before code

Implement only against AGREED specs. Decision made in chat → update the doc first, then code.

Plan, then touch

Propose files + approach; get a nod; change only those files.

Respect boundaries

No scraper/poster code, no billing, no review content in DB, no copy-paste from old data-client-api/.

Keep docs in lockstep

Schema change → bump the component doc's version + changelog line in the same PR.

Verify before "done"

Run ruff, mypy, pytest. Never claim tests pass without running them.

Where to look when stuck

QuestionWhere
"What is X / is X in scope?"dev-docs/00-foundation/ + glossary
"How should I write this?"dev-docs/03-engineering-standards/
"How does feature Y work?"dev-docs/02-components/Y.md + its matching skill
"Who owns this table/flow?"00-foundation/03-team-boundaries.md
"How do I run a migration?"skill: run-migration