Architecture Decisions¶
ADR 1: Adoption of Next.js as Frontend Framework¶
Context
The project needs a modern frontend framework with server-side rendering, good performance, SEO support, scalability, and simple deployment. The team evaluated React, Angular, Vue, and Next.js.
Decision
The frontend will use Next.js because it provides built-in SSR, SSG, routing, and API routes that align well with the project requirements.
Status
Accepted
Consequences
Positive: better performance, stronger SEO support, a mature ecosystem, and simpler routing and full-stack integration
Negative: a learning curve for developers new to Next.js and a more opinionated project structure
Neutral: continued dependency on the React ecosystem
ADR 2: Adoption of Python with FastAPI for Backend Services¶
Context
The project needs a backend that is performant, maintainable, scalable, type-safe, and easy to document. The team evaluated Node.js, Java, and Python frameworks including Django and Flask.
Decision
The backend will use Python with FastAPI because it offers high performance, async support, Pydantic validation, automatic OpenAPI documentation, and a clean developer experience.
Status
Accepted
Consequences
Positive: fast async request handling, generated API documentation, strong validation, and rapid development in Python
Negative: a smaller ecosystem than some enterprise frameworks and a need to understand async patterns
Neutral: dependency on the Python runtime and ecosystem
ADR 3: Usage of pnpm as Package Manager¶
Context
The project needs a package manager that is performant, maintainable, scalable, and easy to use. The team evaluated npm, yarn, and pnpm.
Decision
The project currently leans toward pnpm as the package manager for frontend workflows and dependency management.
Status
Accepted
Consequences
Positive: pnpm is fast and efficient with a smaller disk footprint
Negative: it has a smaller ecosystem footprint than npm
Neutral: adoption continues to grow and the tooling is actively maintained
ADR 4: Database Technology¶
Context
The project needs a persistent database for users, sessions, and match data. The main decision is NoSQL versus relational storage. The team evaluated MongoDB for NoSQL and MariaDB/PostgreSQL for relational options.
Decision
The project will use a relational database. The domain model is strongly structured (users, matches, scores, and session relations), and consistency is important for game state and ranking data. PostgreSQL was selected because it is reliable, well-known by the team, and provides strong SQL capabilities for future analytics and reporting needs.
Status
Accepted
Consequences
Positive: strong data integrity guarantees, rich querying, and mature operational tooling
Negative: schema migrations and relational modeling add design and maintenance overhead
Neutral: development workflows include SQL and migration tooling as standard practice
ADR 5: ORM vs Writing SQL Queries¶
Context
The backend needs a consistent and maintainable way to access relational data. The team evaluated two approaches: writing raw SQL queries for all operations or using an ORM with typed schemas and validation.
Decision
The project will use an ORM-first approach with SQLAlchemy for data access and Pydantic for request/response and domain validation. Raw SQL can still be used for performance-critical or highly specialized queries when needed.
Status
Accepted
Consequences
Positive: improves developer productivity, consistency, and readability in CRUD and relationship-heavy operations
Negative: adds ORM abstraction overhead and requires careful query tuning to avoid performance pitfalls
Neutral: team workflows include SQLAlchemy models and Pydantic schemas as standard backend patterns
ADR 6: WebSocket vs API Polling for Game Communication¶
Context
The project needs a communication pattern for the multiplayer game mode. Gameplay events such as match state updates, countdowns, and answer submissions must be delivered with low latency and in near real time.
Decision
The project will use WebSockets as the primary communication channel for battle mode runtime events. HTTP API endpoints remain in use for non-realtime operations such as authentication, setup, and historical data retrieval.
Status
Accepted
Consequences
Positive: supports bidirectional low-latency communication and improves realtime user experience in matches
Negative: introduces additional complexity for connection lifecycle, reconnect handling, scaling, and monitoring
Neutral: requires team familiarity with event-driven patterns while existing REST endpoints continue to be used where appropriate
ADR 7: Initial Google Login¶
Context
The project needed an authentication mechanism for registered players. Google OAuth was considered because it provides a familiar login flow and avoids building password handling directly into the application.
Decision
Use Google login as the initial identity-provider approach for player authentication.
Status
Overruled by ADR 8
Consequences
Positive: users with an existing Google account would have had a familiar login experience
Positive: the application team would not have needed to operate a separate identity provider
Negative: automated E2E tests would have depended on an external commercial login flow that is difficult to control in Playwright
Negative: registration would have required a Google account and would not have supported the project’s minimal username/password registration goal
Neutral: this decision is kept as historical context; the implemented system uses Keycloak as defined in ADR 8
ADR 8: Authentication with Keycloak¶
Context
The project initially used Google OAuth for authentication. Key concerns were testability (Google’s login flow is difficult to automate in Playwright), dependency on an external commercial service and the inability to support custom registration flows with minimal required fields.
Decision
Replace Google OAuth with a self-hosted Keycloak instance (Authorization Code + PKCE flow via keycloak-js). Keycloak runs as a Docker service and is automatically configured via a realm import on startup. The backend verifies Keycloak access tokens using JWKS public-key validation (PyJWT + PyJWKClient). Application sessions remain backend-managed via HttpOnly cookies in PostgreSQL — only the identity provider changes.
Status
Accepted
Consequences
Positive: login and registration are fully controllable and mockable in E2E tests without external service dependencies.
Positive: registration requires only username and password — no Google account needed.
Positive: Keycloak is open-source and self-hosted, removing the dependency on Google API policies.
Negative: adds a Keycloak container to the deployment stack, which must be kept healthy and configured.
Neutral: the backend continues to manage application sessions and authorization for protected routes.
ADR 10: Process-Local Matchmaking and Battle State¶
Context
Ranked battles require low-latency state transitions, timers, player connections, answer handling, surrender handling, and result publication. The current system persists users, sessions, rankings, question cache entries, and completed match results, but active matchmaking and battle state is runtime state owned by the backend process.
Decision
Keep active queue, match, timer, and battle state in memory inside the backend services. Persist durable outcomes such as match results and ranking changes to PostgreSQL after the battle is completed or forfeited.
Status
Accepted
Consequences
Positive: the implementation stays simple and fast for the current single-backend deployment model
Positive: battle logic can keep direct references to active WebSocket connections and timers without a distributed coordination layer
Negative: active matches are lost when the backend process restarts
Negative: horizontal scaling would require sticky routing, shared state, or a dedicated coordination mechanism
Negative: E2E battle tests need controlled sequential execution because the queue and battle state are shared through the backend process
Neutral: completed match results and ranking updates remain persistent in PostgreSQL
ADR 11: Circuit Breaker for the External Trivia API¶
Context
The external trivia provider (/v2/questions) is reached through
TriviaApiClient, which already protects each call with a request timeout,
retry with exponential backoff, and a local question cache as fallback. These
guards handle short, transient hiccups, but they do not handle a prolonged
upstream outage well: every incoming request still pays the full timeout and
retry budget before failing, which wastes time and connections and keeps
hammering a service that is already down. The canonical resilience pattern for
this situation is a circuit breaker that detects a sustained failure streak and
fails fast instead of retrying on every request.
Decision
Wrap the external trivia call with a circuit breaker using the pybreaker
library. The breaker is layered outside the existing retry loop, so one fully
failed fetch_questions call (after its internal retries) counts as a single
breaker failure. After a configurable number of consecutive failures the breaker
opens and short-circuits further upstream calls, raising
TriviaUpstreamUnavailableError immediately. After a configurable cooldown the
breaker moves to half-open and closes again on the next successful call.
Non-outage errors (non-retryable HTTP responses and invalid payloads) are
excluded from breaker accounting because they do not indicate provider
unavailability. The thresholds are configured via TRIVIA_BREAKER_FAIL_MAX and
TRIVIA_BREAKER_RESET_TIMEOUT, analogous to the other TRIVIA_* settings.
While the breaker is open the existing degradation strategy still applies: the trivia service always serves matching questions from the local cache first, and only short-circuits (fast 503 for HTTP callers, controlled abort for battle setup) when the cache cannot satisfy the request — now without the prior per-request timeout and retry penalty.
Status
Accepted
Consequences
Positive: during a sustained upstream outage the backend fails fast instead of exhausting timeout and retry budgets on every request, reducing latency and load on an already failing provider
Positive: thresholds are environment-configurable and consistent with the existing
TRIVIA_*settings conventionPositive: automatic recovery via the half-open state requires no manual intervention once the provider returns
Negative: adds the
pybreakerdependency and a shared breaker state object that must be sized correctly to avoid premature openingNeutral: the breaker state is process-local, matching the single-backend deployment model described in ADR 10