Analysis

Analysis of Keramik results depends on the purpose of the simulation. You may want to just see average latencies, or dive deeper into reported metrics. For profiling, you will want to use datadog.

Quick Log Analysis

The simulation manager provides a very quick way to analyze the logs of a simulation run. You will need to know the name of the manager pod though. You will first need to see if the simulate-manager pod has completed, by running

kubectl get pods

If the pod has completed and is no longer in that list, you can see recently terminated pods using:

kubectl get event -o custom-columns=NAME:.metadata.name | cut -d "." -f1

Once you have the name of the manager, you can retrieve its logs

kubectl logs simulate-manager-<id>

If the simulate-manager pod is not in your pod list, you may need to get logs with the --previous flag:

kubectl logs --previous simulate-manager-<id>

Analysis with DuckDB or Jupyter

First you will need to install a few things:

pip install duckdb duckdb-engine pandas jupyter jupysql matplotlib

To analyze the results of a simulation first copy the metrics-TIMESTAMP.parquet file from the otel-0 pod. First restart opentelemetry-0 pod so it writes out the parquet file footer.

kubectl delete pod opentelemetry-0
kubectl wait --for=condition=Ready pod/opentelemetry-0 # make sure pod has restarted
kubectl exec opentelemetry-0 -- ls -la /data # List files in the directly find the TIMESTAMP you need
kubectl cp opentelemetry-0:data/metrics-TIMESTAMP.parquet ./analyze/metrics.parquet
cd analyze

Use duckdb to examine the data:

duckdb
> SELECT * FROM 'metrics.parquet' LIMIT 10;

Alternatively start a jupyter notebook using analyze/sim.ipynb:

jupyter notebook

Comparing Simulation Runs

How do we conclude a simulation is better or worse that another run?

Each simulation will likely be targeting a specific result however there are common results we should expect to see.

Changes should not make correctness worse. Correctness is defined using two metrics:

Percentage of events successfully persisted on the node that accepted the initial write.
Percentage of events successfully replicated on nodes that observed the writes via the Ceramic protocol.

Changes should not make performance worse. Performance is defined using these metrics:

Writes/sec across all nodes in the cluster and by node
p50,p90,p95,p99 and p99.9 of the duration of writes across all nodes in the cluster and by node
Success/failure ratio of writes requests across all nodes in the cluster and by node
p50,p90,p95,p99 and p99.9 of duration of time to become replicated. The time from when one node accepts the write to when another node has the same write available for read.

For any simulation of the Ceramic protocol these metrics should apply. Any report about the results of a simulation should include these metrics and we compare them against the established a baseline.

Performance Analysis

In addition to the above, we can also use datadog to dive further into performance.