October 8th, 2024

New

Improved

Fixed

Weekly Product Updates

New

  • Enhanced SLA details on portal and admin dashboard:

    • No Data indicator: You'll now see a "No data for SLO" message when there's no data to compute. This addition helps distinguish between processing issues and genuine lack of data, improving your ability to troubleshoot and understand SLO status.

    • Top-level SLO metadata: The SLA portal now displays comprehensive SLO information including name, description, target, and compliance period details.

    • Daily summary statuses: We've introduced SLO-level status for compliance windows. Instead of just seeing an overall SLA status, you can now track the performance of individual SLOs within their specific compliance windows.

  • Debugging aid: To assist with troubleshooting, we've added a last_update_at field to SLAs (you can access this via the inspect feature). This timestamp indicates when our compute engine last processed the SLA, helping you distinguish between compute issues and missing data scenarios. Remember, our system recalculates SLO and SLA metrics each time a new event is indexed, so this field can be crucial in identifying the source of any discrepancies.

  • Customer → Organization updates: We've made some changes to make our platform more inclusive and flexible:

    • The "vendors" table has been renamed to "organizations". This change reflects our recognition that both producers and consumers of SLAs can benefit from slaOS, moving beyond a vendor-specific approach.

    • In line with this change, we've updated the ingestion URLs to use organization_id instead of customer_id as the unique identifier. This shift allows for a more versatile representation of entities using our platform.

Updates

  • Interval SLO description: We've refined the autofill description for interval SLOs to more accurately reflect the calculation method. Calculation every 24 hours over a calendar period of 1 monthInterval based calculation every 24 hours over a calendar period of 1 month

  • Storage optimization: In an effort to improve system efficiency, we now automatically delete performance documents when an SLA is deleted. This change optimizes storage usage and helps maintain a cleaner, more manageable database.

  • Computation frequency: Based on user feedback about slow computation times, we've increased the frequency of our computation processes. This update should result in faster response times when viewing SLA and SLO data.

  • Compute process enhancements: We've made several behind-the-scenes improvements to our compute process:

    • Enhanced overall maintainability and performance, resulting in more efficient data processing.

    • Simplified and optimized the queries used to identify which SLAs/SLOs need (re)computation. This change includes the application of new indexes to improve query speed.

    • Implemented smarter filtering to avoid unnecessary compute on SLAs/SLOs that haven't received new data or are considered expired (end date several days in the past). This optimization helps allocate system resources more effectively.

Fixes

  • Backend improvements:

    • SLA identifier in performance documents: We've added the SLA Identifier to Performance Documents ID and Data. This fix resolves an issue where overlapping SLAs with the same SLO couldn't be distinguished.

    • Expired SLA handling: Our system now better handles expired SLAs, preventing unnecessary computation attempts on outdated agreements.

    • OpenSearch query optimization: We've addressed the tooManyBucketsException error by increasing the bucket limit. This fix prevents 503 errors that were occurring when creating aggregations with a high number of buckets, ensuring more reliable data retrieval and analysis.

    • Breach count handling: We’ve reduced breach cache period and don't cache 0 breaches anymore so that the breach details are refreshed on the interface accurately.

  • Frontend improvements:

    • White screen of death: We've resolved an issue where creating a new Objective (SLO) was resulting in a white screen after selecting an indicator.

    • Service Target for aggregated SLIs: Event based SLOs built using aggregated SLIs (ie Indicators that have an aggregation such as avg, sum, count etc) can only have a Service Target of 0 or 100% and configuring anything else would result in an error. We’ve introduced a fix on the client side to eliminate cognitive load on the user so that they can only select either 0% or 100%.

  • Calculation corrections:

    • Error budget calculation: Fixed a bug where the error budget was incorrectly showing as 0 even when daily SLO compliance over the entire compliance period was above the target.

    • Compliant missing intervals: We've corrected the logic for handling compliant missing intervals. Previously, these were incorrectly shown as "missing" or "no data". Now, they're properly treated as compliant.

    • Cumulative calculation: Resolved an off-by-one error in cumulative calculations on error budgets. We now correctly index at the beginning of the day instead of the end, ensuring precise day-to-day tracking of your error budget consumption.