A deep technical walkthrough of the platform — the runtime topology, the stack, the design patterns, the async model, the Temporal orchestration, and the queue engine that scales it — plus a study roadmap to get productive fast.
What the system does, and the shape of it in one screen.
ReviewData (RD) is backend integrator infrastructure. It has no customers of its own — other platforms (called integrators, chiefly Reputation Management / RM) call its API to get four jobs done, and RD reports back over webhooks.
| Operation | Kind | How RD handles it |
|---|---|---|
| Scrape reviews | Long-running | Temporal workflow → reviews written to S3 → webhook. |
| Find publisher URLs | Long-running | Temporal workflow → URL written to DB → webhook. |
| Generate AI responses | In-process | A single OpenAI call in the API service — no Temporal. |
| Post responses | Long-running | Temporal workflow (or a human via QA) → webhook. |
request_id and task_id.✗ No customer SPA · ✗ No billing (that's RM) · ✗ No review content in the DB (reviews live in S3; the DB stores URLs + metadata) · ✗ No scraper/poster implementation here (that's the Workflow Team — Page 4).
Read this top-to-bottom as a layered map. Each layer sits on the one below it; learn them in this order.
/docs. This is where every request enters.Mapped/mapped_column/AsyncSession; postgresql+asyncpg://. Sync psycopg2 is forbidden..jl files, log archive) · CloudFront (CDN in front of S3).contextvars correlation IDs · Sentry errors · Prometheus metrics · Argon2id/JWT/HMAC/Fernet security.| Concern | Choice | Where used |
|---|---|---|
| Admin/QA login | JWT (access + refresh) | Admin/QA UI; also an extension-type JWT for the Chrome extension. |
| Integrator auth | API key — SHA-256 hashed at rest | In the request payload; looked up by prefix, then hash-verified. |
| Password hashing | Argon2id (argon2-cffi) | User passwords. |
| Webhook integrity | HMAC-SHA256 (X-RD-Signature) | Signs every outbound webhook body. |
| Secret-at-rest | Fernet symmetric encryption | Publisher credentials, cookies, OAuth tokens. |
pytest + pytest-asyncio (auto), httpx.AsyncClient over in-process ASGI, factory_boy + Faker, coverage gates in CI.
ruff (lint + format), mypy (strict on app/), pre-commit hooks. pip + pinned requirements.txt.
Docker + docker-compose (dev), multi-stage builds for prod, Bitbucket Pipelines CI/CD, AWS ECR + ECS in prod.
Next.js 15 + React 18 + TS (App Router), MUI v6, Zustand (UI state), TanStack Query v5 (server state), RHF + Zod, Recharts.
requests → use httpx · psycopg2 in async paths → use asyncpg · Celery → replaced by Temporal · print() → use the logger · returning ORM models from handlers → always go through Pydantic · time.sleep → await asyncio.sleep.The stack (Page 2) is what the code is made of. This is which processes run and how a request travels across all of them.
| Process | Owner | Responsibility |
|---|---|---|
| API service (Gunicorn/Uvicorn) | Platform | All HTTP: integrator REST, Admin/QA, extension endpoints. Starts workflows. Runs the in-process AI Response call and the background queue-manager loops. |
| Temporal server | Self-hosted | Durable state store + task-queue broker. Nothing business-specific lives here. |
| Scraper / poster / URL-finder workers | Workflow Team | Poll task queues, execute the actual publisher logic (Playwright, HTTP, captcha), write results. |
| webhook-delivery worker | Platform | The one worker Platform owns — runs the outbound webhook delivery workflow + retry schedule. |
| PostgreSQL | Shared | Single source of truth. Both teams read/write; only Platform writes migrations. |
| Redis | Shared | Ephemeral state: rate-limit counters, distributed locks, cookie/session store, scraper session tokens. |
Expensive clients are created once at startup in FastAPI's lifespan context manager and stored on app.state; dependencies pull from there. Creating a new DB engine / HTTP client / Redis pool per request would waste connections and defeat pooling.
@asynccontextmanager
async def lifespan(app: FastAPI):
app.state.db_engine = create_async_engine("postgresql+asyncpg://...")
app.state.redis = Redis.from_url(...) # one shared pool
app.state.http = httpx.AsyncClient(timeout=10.0) # reuse TCP conns
app.state.temporal = await Client.connect(...)
yield
await app.state.http.aclose(); await app.state.redis.aclose()
await app.state.db_engine.dispose()
task_id (202)
Steps 1–5 happen in milliseconds inside the API service. Steps 6–10 happen later, durably, in the worker fleet. The integrator bridges the gap by either polling a retrieve-* endpoint or waiting for the webhook.
The single most important organizational fact. Internalize the boundary before you write code.
The contract surface: REST API, RBAC, Admin/QA UI, webhook delivery, the DB schema + all migrations, every Pydantic model, the extension's auth, and the AI Response service.
The deep publisher logic: scrapers, posters, URL-finders — browser automation (Playwright), captcha solving, proxy rotation. Runs as Temporal workflows + activities.
FailureReason enum are Platform-owned. To fire a webhook, the Workflow Team calls a shared emit_webhook activity — never their own HTTP code.task_id is generated by Platform at request time; the workflow is started with ID "{type}-{task_id}", so the Admin UI can deep-link into Temporal Web and the Workflow Team can look up job context by ID.| Step | Owner |
|---|---|
| Validate request, create job row, set initial status, start workflow | Platform |
| Update status during execution; write final result (S3 URLs, errors) | Workflow Team |
| Fire the completion webhook (via the shared emit service) | Workflow Team |
scrape_result_batches.s3_url · Workflow Team calling FastAPI endpoints internally (they have direct DB access — endpoints are external-only) · either team rolling its own webhook HTTP · reading/writing the other team's tables without an explicit shared contract.Model → Controller → Service. Strict, and non-negotiable. It is the rule you will be reviewed against most often.
A service is a plain class wrapping a session — unit-test it without booting FastAPI or hitting HTTP.
Thin controllers physically can't accumulate hidden logic; the ≤20-line rule forces it into services.
Every endpoint looks identical, so any file is readable on first open.
Services speak domain language (NotFoundError); one global handler turns that into HTTP + the envelope.
Controller — wiring only:
@router.post("/admin/users/{user_id}/disable", response_model=UserOut,
dependencies=[require_permission("user.manage")])
async def disable_user(user_id: uuid.UUID,
session: AsyncSession = Depends(get_session),
auth: AuthContext = Depends(get_auth_context)) -> UserOut:
return await UserService(session).disable_user(user_id, auth.user_id)
Service — logic + the transaction:
class UserService(BaseService):
async def disable_user(self, user_id, actor_id) -> User:
user = await self.db.get(User, user_id)
if user is None:
raise NotFoundError("User not found") # domain exception
user.is_active = False
user.disabled_at = now_dt(); user.disabled_by = actor_id
await self.commit() # transaction lives HERE
await self.db.refresh(user)
return user
Controllers, services, and schemas each split into the same seven domains: core, admin, qa, integrator, posting, scraping, internal. Only models stay flat (app/models/*.py, one file per table). External I/O (HTTP, S3, Redis, Temporal) goes through dedicated client classes injected at the service layer.
HTTPException in services · ④ imports inside functions · ⑤ inline BaseModel in controllers. Also modern typing only (X | None, list[X]), and from __future__ import annotations first in every file.The recurring shapes you'll see everywhere. Recognize them and new code writes itself.
Every service extends BaseService and is constructed inline as UserService(session) — effectively zero-cost (one attribute set), stateless. Per-service Depends() factories were deliberately rejected for now: they double indirection without earning testability. The service owns the transaction; the controller owns nothing but the HTTP call.
All errors subclass AppError (which carries an HTTP code, a stable error_code, and a message). Services raise them; a global handler renders the one true envelope. Clients branch on the machine-readable error_code, never on prose.
class AppError(Exception): code = 500; error_code = "INTERNAL_ERROR"
class NotFoundError(AppError): code = 404; error_code = "NOT_FOUND"
class ConflictError(AppError): code = 409; error_code = "CONFLICT"
# domain-specific extends the base:
class DuplicateLocationError(ConflictError): error_code = "DUPLICATE_LOCATION"
Request schema in, response schema out — and they are separate types. A request schema validates untrusted input; a response schema (with ConfigDict(from_attributes=True)) hydrates from an ORM row. Integrator responses inherit a StandardEnvelope (code + status) as their first two fields; pagination uses a generic Paginated[T] wrapper.
Because Temporal retries and integrators retry, every retryable operation must be safe to run twice. The patterns:
| Boundary | Idempotency key |
|---|---|
| Starting a workflow | Deterministic ID scrape-{task_id} → a duplicate start raises WorkflowAlreadyStartedError, caught & treated as success. |
| DB writes in an activity | INSERT ... ON CONFLICT DO NOTHING/UPDATE. |
| S3 writes | Deterministic key {request_id}/{task_id}-batch-{n}.jl — overwrite is safe (same content). |
| Outbound webhook | delivery_id — integrator dedupes on its side. |
| Integrator retry of a POST | De-dup on (api_key, foreign_key, endpoint) for a short window. |
contextvarsA middleware stamps request_id (and task_id/user_id when in scope) into a contextvar; the logger reads it automatically, so every log line across the async call stack is correlated without threading an argument everywhere.
Clients live on app.state (Page 3); the session is injected per-request via Depends(get_session) — one session per request, committed at the service boundary, never in the controller. Permission gates are declarative dependencies: dependencies=[require_permission("...")].
is_active=False or stamp deleted_at. History is a feature, not clutter.The whole stack runs on one event loop per process. A single blocking call stalls every concurrent request — treat sync I/O like a memory leak.
async def. If it calls an async def, it must await it. No exceptions on hot paths.| Resource | Use | Never |
|---|---|---|
| Database | AsyncSession, session.scalar()/scalars(), postgresql+asyncpg:// | session.query() (1.x style), psycopg2 |
| HTTP | httpx.AsyncClient, reused from app.state | requests; a new client per call |
| Redis | redis.asyncio via RedisService, one shared pool | the sync redis API; a client built ad-hoc |
| S3 | aioboto3 (or boto3 inside asyncio.to_thread) | blocking boto3 on the loop |
| Sleep / locks | await asyncio.sleep(), asyncio.Lock | time.sleep(), threading.Lock |
# Fan out in parallel
results = await asyncio.gather(fetch_a(), fetch_b(), fetch_c())
# Bounded concurrency (don't stampede a dependency)
sem = asyncio.Semaphore(10)
async def bounded(x):
async with sem:
return await process(x)
results = await asyncio.gather(*(bounded(x) for x in items))
# Deadline
async with asyncio.timeout(5):
result = await slow_call()
# Unavoidable CPU-bound / sync code → offload the thread
result = await asyncio.to_thread(blocking_parse, blob)
| Work | Mechanism |
|---|---|
| Short, non-critical (cache warm, fire-and-forget log) | asyncio.create_task(...) in the handler |
| Durable, retryable, multi-step (scrape, post, webhook) | Temporal workflow |
| Recurring poll loops (the queue managers) | Long-lived task started in lifespan |
| Anything you'd have used Celery for | Temporal. Celery is banned. |
await (you now hold a coroutine, not the value) · a sync DB driver stalling the loop for all requests · a fresh httpx.AsyncClient per call (kills connection reuse) · asyncio.run() inside app code (the ASGI server owns the loop) · nest_asyncio (forbidden).The database is the shared backbone between both teams. These are the entities and identifiers you'll meet daily.
external_location_id.url_status: missing → pending_finder → pending_qa → active → invalid.One row per publisher_key — the routing brain. Decides scraper engine (internal SAU vs external LDE), posting mode (AUTOMATED / MANUAL_QA / DISABLED), and auth type (OAUTH / CREDENTIALS / COOKIES).
A pooled email alias for the Business Manager (BM) flow — a publisher emails a "manager invite" QA must accept. Aliases rotate out of the pool after a confirmation threshold.
Job tables (Platform creates the row; Workflow Team writes results): scrape_jobs, posting_jobs, url_finder, ai_response_job, competitive_report_job, insight_report_job.
Cross-cutting: inbound_request (every payload archived, keyed by request_id) · webhook endpoints/subscriptions/deliveries · users/roles/permissions (RBAC) · api_keys · qa_task/qa_task_event/qa_stream_config · jwt_token.
| Identifier | Generated by | Purpose |
|---|---|---|
request_id | Platform (inbound) | Correlates one request → all downstream work + every log line. Never null. |
task_id | Platform (job create) | The job ID; reused verbatim as the Temporal workflow ID. |
delivery_id | Platform (per attempt) | Changes on each webhook retry; the integrator's idempotency key. |
foreign_key | Integrator | RM's own reference; RD only echoes it back. |
external_location_id | Integrator | RM's business.id. |
internal_business_id | Discovered while scraping | Publisher-side business ID (e.g. place_id); required to post. |
internal_review_id | Discovered while scraping | Publisher-side review ID; sent to the publisher when posting a reply. |
Temporal is how RD runs work that must survive crashes, retries, and 24-hour human waits. This is the engine room.
Deterministic orchestration code. Its state is durably replayed after any crash. No direct I/O, no randomness, no wall-clock — use workflow.now(), workflow.uuid4(), workflow.random().
The side-effecting unit — HTTP calls, DB writes, S3 uploads, Playwright. Retried automatically by Temporal, so it must be idempotent.
{type}-{task_id} (scrape-…, post-…, urlfinder-…, webhook-{delivery_root_id}). Deterministic & human-readable → the Admin UI computes the Temporal Web URL straight from a job row.WorkflowAlreadyStartedError and treat as success.upload_review_batch_to_s3, post_response_via_playwright, emit_webhook — never process/run/execute.When a workflow needs a human (a URL entered, cookies captured), it waits — durably, for up to 24h — on a signal. This is how the QA board plugs into a running workflow.
@workflow.signal
async def provide_url(self, payload: UrlProvidedPayload):
self._provided_url = payload.url
# elsewhere in the workflow body — always with a timeout:
await workflow.wait_condition(
lambda: self._provided_url is not None,
timeout=timedelta(hours=24),
)
job = await self.repo.create(...)
await self.session.commit() # commit FIRST
await self.temporal.start_workflow(
"ScrapeWorkflow", ScrapeWorkflowInput(scrape_job_id=job.id, ...),
id=f"scrape-{job.id}",
task_queue=route_to_task_queue(payload.publisher_key), # Page 10
)
Temporal routes work by task queue. Different work has wildly different cost profiles, so it gets different queues and different concurrency limits.
| Queue | Purpose | Concurrency profile |
|---|---|---|
scraping-internal | Internal SAU scraper workflows | normal |
scraping-lde | LDE-routed scrapes | normal |
scrape-<publisher> | Per-publisher scrape workflows + their HTTP activities | normal (per queue) |
scraping-browser | Activity-routing target: only the browser-backed activities hop here | bounded by the browser-api session envelope (~12) |
posting | Response posting (most common) | normal |
posting-browser | Cookie/credential posting (Playwright) | 1 per worker — Playwright is heavy |
url-finder | URL discovery | normal |
webhook-delivery | Outbound webhook delivery + retry | high — horizontally scaled |
maintenance | Cron-style: cookie health, alias rotation, archival | low |
scrape-<publisher> and its cheap HTTP activities stay there — but its expensive browser-backed activities hop onto the shared scraping-browser pool. Why: one browser-api container handles only ~15–20 concurrent Chrome sessions, and you don't want that scarce resource throttling cheap HTTP. The browser pool is a global cap (~10–12); the HTTP work scales wide independently.DATA_RESULT:REVIEW_DATA vs :LDE_DATA); ③ the scrape "queue" backlog of BUFFERED job rows in Postgres that the queue manager dispatches (Page 11). They are three different things.Roughly ~30 task-queue workers run inside worker processes on ECS (EC2, c6i.2xlarge, bin-packed ~3 tasks/machine). The fleet scales 1 → 12 tasks across up to 4 machines. Crucially, it does not scale on CPU/memory (HTTP-light scraping sits at ~13% CPU) — it scales on queue backlog, which is the whole subject of the next page.
RD's most sophisticated Platform-owned subsystem. It decides how many workflows run, which jobs get a slot, and how big the worker fleet grows. Learn it — it's the interview-grade part of the codebase.
Scrape submits are cheap and instant, but scrapes are slow and resource-bounded. If every submit started a live Temporal workflow immediately, a burst of thousands would flood one event loop, starve the pollers, and freeze the fleet (this actually happened in prod). So jobs land in Postgres as BUFFERED rows, and a single background queue manager meters them into Temporal.
PER_TASK_HTTP_CONCURRENCY (50) × running_worker_count, plus a global browser cap of 10. So 1 worker → gate 50; 12 workers → gate 600 — each task always sees a safe per-task load regardless of fleet size. Worker count comes from Temporal's WorkerHeartbeat registry (presence-based, so a saturated worker is still counted — an activity-poller count under-reported the busy fleet by ~45%).OutstandingScrapeJobs = BUFFERED + PENDING_SCRAPE + QUEUED + IN_PROGRESS — all non-terminal work, not just waiting depth (scaling in while heavy workflows were still running once stranded ~150 of them).ROW_NUMBER() OVER (PARTITION BY publisher_key) query — so a 1,000-job flood of one publisher can't head-of-line-block a single job of another.scrape_priority (lower = higher; production = 100, test/staging raised to 1000). Dispatch is priority-first: the whole production tier drains before any test-tier job gets a slot — but test load still counts toward autoscaling, so it's a first-class citizen for capacity, it just yields dispatch order.Insight (C32) and Competitive (C34) reports reuse the exact pattern: a BUFFERED admission state + a ReportQueueManagerService, gating at the process-start boundary (not submit) — so even the "600 scrapes finish at once → 600 resume hooks fire" thundering herd gets metered. Two independent lanes (insight / competitive, ~40 concurrent each, since they run on separate worker fleets), with a shared OpenAI backpressure belt that honors 429/Retry-After above both.
The external contract: how integrators submit work and how RD reports back.
API key in the payload; every response wraps the standard envelope. Async endpoints return 202 Accepted + a task_id.
| Endpoint | Kind | Does |
|---|---|---|
POST /request-reviews | Async → task_id | Kick off a scrape. |
POST /retrieve-task | Sync, paginated | Read scrape results. |
POST /get-publisher-info | Sync | Look up publisher/account info. |
POST /submit-response | Async → task_id | Post a reply to a review. |
POST /retrieve-posting-data | Sync, paginated | Read posting results. |
POST /ai-response | Sync/async (per config) | Generate an AI reply (in-process OpenAI). |
POST /url-finder/per-business · /file-based | Async → task_id | URL discovery, single or bulk. |
POST /request-reviews; Platform authenticates the key and archives the payload in inbound_request (request_id).scrape_jobs row is created (BUFFERED) with a fresh task_id; the endpoint returns 202 + task_id.scrape-<task_id> on the right queue.PublisherConfiguration — and writes reviews to S3 as .jl files (DB stores only CloudFront URLs + metadata).EVENT.DATA_RESULT via the shared emit_webhook activity./retrieve-task).| Event | Fires when |
|---|---|
EVENT.URL_UPDATED | A publisher URL is resolved (auto or QA). |
EVENT.DATA_RESULT | A scrape completed. |
EVENT.CREDENTIAL_UPDATED · EVENT.COOKIE_UPDATED | QA added credentials / captured cookies. |
EVENT.RESPONSE_SUBMISSION | A posting attempt finished (success or failure). |
EVENT.PUBLISHER_DISCONNECTED | An account got disconnected during posting. |
Webhook delivery isn't a fire-and-forget HTTP call — it's a Temporal workflow (Platform-owned, on the webhook-delivery queue) that loops an 8-attempt retry schedule (workflow.sleep() between attempts) over activities http_post_webhook → record_attempt → mark_dead_letter. After 8 failures it's dead-lettered and surfaced in the Admin UI for manual replay. Each attempt carries a fresh delivery_id; the body is HMAC-signed.
The human-in-the-loop system, and how failures are shaped into a clean contract.
AUTOMATED — the poster workflow logs in and posts (Yelp, Booking). MANUAL_QA — routed to a human via a QA stream (OAuth publishers like Google/Facebook). DISABLED — off.
OAUTH — token-based. CREDENTIALS — Playwright login, session captured. COOKIES — QA pre-captures session cookies via the extension; the poster replays them.
Tasks have lifecycle events (qa_task_event), are claimed/released by operators, scoped per user via stream configs, governed by qa.task.* permissions. An SLA sweeper flags stale tasks; reconcilers keep the board consistent with the underlying jobs. Completing a QA action signals the waiting Temporal workflow (Page 9) — that's the bridge between the board and the engine.
Services raise typed AppError subclasses; a global handler renders exactly this — raw FastAPI/Pydantic/SQLAlchemy errors never reach a client, and stack traces go to logs, not responses.
{
"code": 409,
"status": "failed",
"error": "A location with this external_location_id already exists.",
"details": { "external_location_id": ["already exists"] },
"meta": {
"request_id": "9f9d2b5a-...",
"timestamp": "2026-04-28T14:23:11Z",
"error_code": "DUPLICATE_LOCATION"
}
}
foreign_key).warning; 5xx → exception (with traceback). Expected validation failures are never logged at error level.error_code is SCREAMING_SNAKE_CASE, stable across versions, kept in a constants module so it's discoverable — clients branch on it.Structured JSON logs (loguru) carry request_id/task_id/user_id on every line via contextvars; logs ship to S3 (daily partitions) + Datadog. Sentry catches exceptions — note a bare logger.error() does not reach Sentry; you must capture_exception(...). Prometheus exposes metrics.
A week-by-week path. Each item says why it matters here and what "done" looks like.
async/await & the event loop (Page 7). Coroutines, gather, Semaphore, timeout, why blocking calls are banned.
list[X], X | None, Annotated; from __future__ import annotations. RD runs mypy strict.
/submit-response returns a task_id instead of the final result.Depends, response_model, declarative permission gates. Skim the live /docs.
backend/03-fastapi-conventions.md.BaseModel, Field, validators, BaseSettings, from_attributes. Understand why request and response schemas are separate types.
backend/04-pydantic-schemas.md.Mapped, AsyncSession, select()/scalars(), relationships, migrations.
alembic upgrade head. Ref: backend/05-sqlalchemy-and-alembic.md; skills: alembic-migrations, run-migration.{type}-{task_id} convention.
backend/06-temporal-conventions.md; skill: temporal-workflows.01-architecture/01-mcs-architecture.md until you spot a violation instantly.
01-architecture/06-worker-queue-driven-autoscaling.md. It's dense but it's the crown jewel.
logger.error misses Sentry.
backend/07-error-handling.md, 08-logging.md, 09-testing.md.Being able to narrate these flows out loud is the real test that you understand RD.
/request-reviews → BUFFERED → queue manager → workflow → S3 → EVENT.DATA_RESULT. Name where the job row is created, who owns each status, and how RM finds out./submit-response for an AUTOMATED publisher (poster workflow) vs a MANUAL_QA one (spawns a QA task, workflow waits on a signal).EVENT.COOKIE_UPDATED → poster replays them.02-components/01-rbac-and-users.md · 02-api-keys.md — auth.04-locations-sites-accounts.md — the core hierarchy.06-…request-reviews · 09-…submit-response — the two key async endpoints.13-webhooks.md + 14-failure-reasons-enum.md — the outbound side.16-manual-tasks-qa-board.md · 23-posting-workflows.md · 27-cookie-health.md — QA + posting internals./docs; alembic upgrade head; run pytest.AsyncClient fixture + a factory_boy factory.Cite path:line; if unsure, read or grep first. This repo takes anti-hallucination seriously.
Implement only against AGREED specs. Decision made in chat → update the doc first, then code.
Propose files + approach; get a nod; change only those files.
No scraper/poster code, no billing, no review content in DB, no copy-paste from old data-client-api/.
Schema change → bump the component doc's version + changelog line in the same PR.
Run ruff, mypy, pytest. Never claim tests pass without running them.
| Question | Where |
|---|---|
| "What is X / is X in scope?" | dev-docs/00-foundation/ + glossary |
| "How should I write this?" | dev-docs/03-engineering-standards/ |
| "How does feature Y work?" | dev-docs/02-components/Y.md + its matching skill |
| "Who owns this table/flow?" | 00-foundation/03-team-boundaries.md |
| "How do I run a migration?" | skill: run-migration |