-
Notifications
You must be signed in to change notification settings - Fork 1
docs: group cloud storage components under architecture/cloud-storage #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
c71ae57
docs: group cloud storage components under architecture/cloud-storage
senolcolak 9b53667
fix: correct dead links after cloud-storage folder restructure
senolcolak 9201f7b
docs: expand cloud-storage section with Overview, Liquid-Ceph, and Ob…
senolcolak 10da036
docs: move Prysm into Observability & Audit subsection
senolcolak e45c001
fix: add language tag to fenced code block (MD040)
senolcolak 5e94a35
Potential fix for pull request finding
senolcolak c2e5c73
Potential fix for pull request finding
senolcolak File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
File renamed without changes.
File renamed without changes.
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,74 @@ | ||
| --- | ||
| title: Cloud Storage | ||
| --- | ||
|
|
||
| # Cloud Storage | ||
|
|
||
| CobaltCore's cloud storage layer is built on [Ceph](./ceph.md), a distributed storage system that delivers object, block, and file storage in a single unified platform. The surrounding components handle lifecycle automation, data replication, high-availability quorum, observability, and liquid storage allocation — each with a focused responsibility. | ||
|
|
||
| ## Architecture | ||
|
|
||
| The storage stack is organized into three layers: | ||
|
|
||
| **Foundation** — Ceph provides the core distributed storage engine. All other components either operate it, extend it, or observe it. | ||
|
|
||
| **Operations** — [Rook](./rook.md) runs as a Kubernetes operator and manages the full lifecycle of Ceph daemons (monitors, managers, OSDs, MDS, RGW) as containerized workloads. [Arbiter](./arbiter.md) extends quorum into stretched cluster topologies by deploying external Ceph monitors that Rook does not manage directly. | ||
|
|
||
| **Data Services** — [Chorus](./chorus.md) provides zero-downtime data replication and migration between object storage systems (S3 and Swift). [Liquid-Ceph](./liquid-ceph.md) enables dynamic, on-demand storage allocation across the cluster. | ||
|
|
||
| ## Components | ||
|
|
||
| | Component | Layer | Role | | ||
| |-----------|-------|------| | ||
| | [Ceph](./ceph.md) | Foundation | Distributed storage engine — block (RBD), file (CephFS), object (RGW) | | ||
| | [Rook](./rook.md) | Operations | Kubernetes operator for Ceph lifecycle management | | ||
| | [Arbiter](./arbiter.md) | Operations | External Ceph monitors for quorum in stretched clusters | | ||
| | [Chorus](./chorus.md) | Data Services | Zero-downtime object storage replication and migration | | ||
| | [Liquid-Ceph](./liquid-ceph.md) | Data Services | Dynamic storage allocation across the Ceph cluster | | ||
| | [Observability & Audit](./observability/) | Observability | Metrics, dashboards, alerting, and audit — Prometheus, Perses, Prysm | | ||
|
|
||
| ## Storage Interfaces | ||
|
|
||
| Ceph exposes three storage interfaces that CobaltCore services consume: | ||
|
|
||
| - **RBD (RADOS Block Device)** — thin-provisioned, resizable block volumes used by virtual machines and databases. Striped across OSDs for parallel I/O and backed by RADOS snapshots and replication. | ||
| - **CephFS** — POSIX-compliant distributed filesystem. Metadata is managed by a dedicated MDS cluster; data is striped across OSDs. Supports snapshots, quotas, and multiple active MDS daemons for horizontal metadata scaling. | ||
| - **RGW (RADOS Gateway)** — S3 and Swift-compatible object storage gateway. Supports multi-tenancy, versioning, lifecycle policies, server-side encryption, and multi-site active-active replication. | ||
|
|
||
| ## Data Flow | ||
|
|
||
| ```text | ||
| Applications / VMs | ||
| │ | ||
| ┌───────┴────────────────────┐ | ||
| │ RBD │ CephFS │ RGW │ ← Ceph interfaces | ||
| └───────┴────────────────────┘ | ||
| │ | ||
| RADOS (Reliable Autonomic Distributed Object Store) | ||
| │ | ||
| OSDs across cluster nodes | ||
| │ | ||
| ┌────┴─────┐ | ||
| │ Rook │ ← manages daemon lifecycle via Kubernetes CRDs | ||
| └──────────┘ | ||
| │ | ||
| ┌────┴──────┐ ┌─────────┐ ┌────────────┐ | ||
| │ Arbiter │ │ Chorus │ │ Liquid-Ceph│ | ||
| └───────────┘ └─────────┘ └────────────┘ | ||
| (quorum) (replication) (allocation) | ||
| │ | ||
| ┌────┴──────────────────────────┐ | ||
| │ Observability & Audit │ | ||
| │ Prometheus · Perses · Prysm │ | ||
| └───────────────────────────────┘ | ||
| ``` | ||
|
|
||
| ## High Availability | ||
|
|
||
| Ceph achieves HA through monitor quorum (typically 3 or 5 monitors), OSD replication or erasure coding, and MDS standby daemons. In stretched deployments that span two sites, [Arbiter](./arbiter.md) deploys a third monitor at a tiebreaker site so that quorum is maintained even if one full site goes offline. | ||
|
|
||
| ## See Also | ||
|
|
||
| - [Observability & Audit](./observability/) — Prometheus metrics, Perses dashboards, and Prysm CLI for the storage stack | ||
| - [Ceph upstream architecture docs](https://docs.ceph.com/en/latest/architecture/) | ||
| - [Rook documentation](https://rook.io/docs/rook/latest-release/Getting-Started/intro/) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| --- | ||
| title: Liquid-Ceph | ||
| --- | ||
|
|
||
| # Liquid-Ceph | ||
|
|
||
| Liquid-Ceph enables dynamic, on-demand storage allocation across the CobaltCore Ceph cluster. It abstracts the complexity of pool and quota management, allowing workloads to claim storage capacity fluidly without manual pre-provisioning steps. | ||
|
|
||
| ::: info | ||
| Detailed documentation for Liquid-Ceph is in progress. This page will be updated as the component matures. | ||
| ::: | ||
|
|
||
| ## See Also | ||
|
|
||
| - [Ceph](./ceph.md) — the underlying distributed storage engine | ||
| - [Rook](./rook.md) — Kubernetes operator managing Ceph lifecycle |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| --- | ||
| title: Observability & Audit | ||
| --- | ||
|
|
||
| # Observability & Audit Overview | ||
|
|
||
| CobaltCore monitors the cloud storage stack through a combination of Prometheus-based metrics collection, Perses dashboards, and the Prysm observability CLI. Together they provide real-time visibility into Ceph cluster health, OSD performance, RGW throughput, storage capacity trends, and audit compliance. | ||
|
|
||
| ## Stack | ||
|
|
||
| | Component | Role | | ||
| |-----------|------| | ||
| | [Prometheus](./prometheus.md) | Scrapes and stores time-series metrics from Ceph, Rook, and RGW exporters | | ||
| | [Perses](./perses.md) | Dashboard platform for visualizing storage metrics (alert rules are defined as Prometheus rules) | | ||
| | [Prysm](./prysm.md) | CLI-based observability tool for Ceph clusters and RGW — real-time monitoring, SMART disk health, log compliance | | ||
|
|
||
| ## Key Metrics | ||
|
|
||
| The following signal categories are covered by the observability stack: | ||
|
|
||
| - **Cluster health** — overall Ceph health status, OSD up/in counts, monitor quorum state | ||
| - **Capacity** — raw and usable capacity, per-pool usage, growth rate projections | ||
| - **Performance** — OSD read/write latency, IOPS, throughput per interface (RBD, CephFS, RGW) | ||
| - **RGW** — request rates, error rates, bandwidth per bucket and user | ||
| - **Replication** — Chorus replication lag, sync success/failure rates | ||
| - **Availability** — Arbiter monitor reachability, MDS active/standby state | ||
| - **Audit** — log compliance analysis and access audit via Prysm consumers | ||
|
|
||
| ## Alerting | ||
|
|
||
| Alerts are defined as Prometheus rules and surfaced through the CobaltCore alerting pipeline. Critical thresholds include OSD near-full (85%), cluster degraded state, monitor quorum loss, and RGW error rate spikes. | ||
|
|
||
| ## See Also | ||
|
|
||
| - [Prometheus](./prometheus.md) | ||
| - [Perses](./perses.md) | ||
| - [Prysm](./prysm.md) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,30 @@ | ||
| --- | ||
| title: Perses | ||
| --- | ||
|
|
||
| # Perses | ||
|
|
||
| Perses is the dashboard platform used in CobaltCore to visualize cloud storage metrics collected by [Prometheus](./prometheus.md). It provides pre-built dashboards for Ceph cluster health, OSD performance, RGW traffic, and capacity planning. | ||
|
|
||
| ## Dashboards | ||
|
|
||
| | Dashboard | Purpose | | ||
| |-----------|---------| | ||
| | Ceph Cluster Overview | Health status, OSD counts, monitor quorum, capacity summary | | ||
| | OSD Performance | Per-OSD read/write latency, IOPS, throughput | | ||
| | Pool Usage | Capacity and object counts per Ceph pool | | ||
| | RGW Traffic | Request rate, error rate, bandwidth per bucket and user | | ||
| | Replication Status | Chorus sync lag and success/failure rates | | ||
|
|
||
| ## Dashboard-as-Code | ||
|
|
||
| Dashboards are managed as code using the Perses CUE SDK and deployed via CI. This ensures dashboards are version-controlled alongside the rest of the CobaltCore configuration. | ||
|
|
||
| ::: info | ||
| Dashboard definitions and deployment configuration are in progress. | ||
| ::: | ||
|
|
||
| ## See Also | ||
|
|
||
| - [Prometheus](./prometheus.md) — metrics source for all dashboards | ||
| - [Observability Overview](./index.md) |
37 changes: 37 additions & 0 deletions
37
docs/architecture/cloud-storage/observability/prometheus.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| --- | ||
| title: Prometheus | ||
| --- | ||
|
|
||
| # Prometheus | ||
|
|
||
| Prometheus collects and stores time-series metrics from the CobaltCore cloud storage stack. It scrapes exporters provided by Ceph, Rook, and the RADOS Gateway, making storage metrics available for alerting and dashboard queries. | ||
|
|
||
| ## Exporters | ||
|
|
||
| | Exporter | Source | Metrics | | ||
| |----------|--------|---------| | ||
| | `ceph-exporter` | Ceph daemons | OSD stats, pool usage, cluster health, latency histograms | | ||
| | `rook-ceph-mgr` | Rook Ceph manager | Operator status, daemon lifecycle events | | ||
| | `radosgw-exporter` | RGW | Request rates, error rates, per-user and per-bucket bandwidth | | ||
|
|
||
| ## Retention and Storage | ||
|
|
||
| Metrics are retained according to the cluster-wide Prometheus retention policy. Long-term storage is handled by the remote-write pipeline configured in the CobaltCore monitoring stack. | ||
|
|
||
| ## Alert Rules | ||
|
|
||
| Storage-specific alert rules are maintained alongside the other CobaltCore alerting rules. Key rules include: | ||
|
|
||
| - `CephHealthWarning` / `CephHealthError` — cluster health degradation | ||
| - `CephOSDNearFull` — OSD usage exceeding 85% | ||
| - `CephMonQuorumLost` — loss of monitor quorum | ||
| - `RGWHighErrorRate` — elevated 5xx rate on the gateway | ||
|
|
||
| ::: info | ||
| Detailed rule definitions and Prometheus configuration are in progress. | ||
| ::: | ||
|
|
||
| ## See Also | ||
|
|
||
| - [Perses](./perses.md) — dashboard platform consuming these metrics | ||
| - [Observability Overview](./index.md) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.