Service Level Objectives and Agreements

Warning Please note, these are currently draft and have yet to be agreed or signed off.

Service Level Indicators (SLIs) are the metrics used to measure the level of service provided to end users (e.g., availability, latency, throughput). Service Level Objectives (SLOs) are the targeted levels of service, measured by SLIs. Monitoring and alerting will be performed by using Prometheus and Grafana

	Indicator	Objective
Aggregated Availability	the percentage of requests served successfully, i.e. 100 * (total requests served successfully 2XX status code / total valid requests 2XX/5XX status code) through the internal API Gateway	>99% of requests to POST /events result in a status 200 (not 500 error)
Endpoint response time	Response time for all API requests, including success and failure	99th centile of duration in a 24 hour window < 1,000ms
End to end latency	Time from event published by a supplier to being consumed by an acquirer	99th centile of duration in a 24 hour window < 1h
Service latency	Time between an event being ingested to being available for an acquirer	99th centile of duration in a 24 hour window < 5m
Ingestion availability	This will be defined per data source. For example with HMPO, is the file available within the agreed window?	-
Target Capacity	The service is able to meet the above SLAs with load not exceeding the target number of events in a given time window	6,000 events in a 24 hour window
Data Subject Requests	All requests for supporting Access Requests from relevant Data Controllers will be handled within a given time frame	7 days

Service Capacity

As above, the service is designed for a capacity approximately twice the expected peak load, and tested to such. In non production environments, the mock data generation averages 5 events per minute, or 7,200 in 24 hours, but gradually released through the day, so is not representative of a peak load. If required additional peak load testing data can be arranged.

Service Availability (RPO, RTO)

In addition to the Aggregated Availability and other SLOs above, we also have targets around RPO/RTO.

During the private beta, we’re aiming for

restoration attempted during business hours, which may be 72 hours (over a weekend). Further details in the incident levels in incident managment.
loss of up to one hour of metadata/audit data, but acquirers will receive all data from upstream suppliers, potentially with delay