Skip to content

New data record schema (Parquet) to replace .nc files #101

Description

@lkstrp

The current .nc files are just the standard IO type used by PyPSA. For this app they have tons of disadvantages and are just not made for it. PyPSA can't lazy load at all currently, for a small plot the full .nc file needs to be loaded into memory. The schema is very PyPSA specific and isn't cleanly structured and can't be easily expanded. ...

Proposed new data record/ schema:

  • Schema derived by PyPSA, owned by App so we can adapt as needed.
  • A data record is just an individual representation with data of that schema.
  • Processing multiple records (queries for analytics, but could be for anything) can just simply work by pointing to N records, when the query and processing generalises over multiple dimensions. I should be able to get the energy balance of a single data record as well as multiple data records combined via the same defined query.
  • Multiple scenarios are not represented by a single record. But the same paros logic can work on multiple records, which allows processing of multiple scenarios.
DataRecordA/
  ├── manifest.json          # version, attribute catalog — immutable, user may not change anything here
  ├── snapshots.parquet
  ├── periods.parquet
  ├── components.parquet
  ├── scenarios.parquet      # for stochastic optimization, not workflow scenarios
  ├── data/
  │   └── <attr>.parquet     # ComponentType | component | snapshot | scenario | period | value
  └── results/
      └── <attr>.parquet     # ComponentType | component | snapshot | scenario | period | value

To be discussed/ unclear:

  • When should a Data Record be immutable? Should it? For example snapshotting it before or after a job in a workflow
  • Should a single data record store data of multiple results? Or would you rather create one data record per result and combine them for shared analytics?

Todos:

  • Implement first version
  • Draft Collections of multiple data records. Similar to statistics module in PyPSA which works on a single Network, as well as a collection of Networks.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions