On-Call/Incidence response
Last updated
Was this helpful?
Last updated
Was this helpful?
As a scaled Ethereum staking provider, you're responsible for a significant part of the network's overall health and security. This guide provides you with targeted information on what to prioritize when incidents happen, ensuring that you can react effectively.
Monitor performance and error metrics such as missed attestations, node latency, and validator performance to identify issues early. Implement alerts for any anomalies in these metrics.
Set predefined thresholds for raising an alarm. For example, if more than 5% of validators are underperforming or if you observe an unusual surge in network requests, it should immediately trigger an alarm.
Initial Assessment: Determine the scope of the problem. Is it affecting one validator, multiple validators, or is it a network-wide issue?
Isolate the Issue: Segregate the affected validators to prevent the issue from spreading.
Consult Logs: Review system logs for any error messages or anomalies that could point to the root cause.
Communication: Notify your internal team. Transparency and quick communication are vital, especially if the issue impacts more than your operations.
Message Channels and Forums: While it's sensitive information, sharing what you suspect is an attack on public channels like Discord or Reddit can be valuable for corroborating with others.
Social Media: Use X or other platforms to alert the community; however, be very cautious and responsible with the language you use to prevent unnecessary panic.
Network Peers: If you're part of any coalitions or partnerships with other node operators, inform them so that they can also take precautionary measures.
Security Team: Alert your internal security team first for an initial assessment.
Ethereum Foundation Security: They have a responsible disclosure process for vulnerabilities.
GitHub: If the vulnerability is in an open-source tool, you may also open a confidential issue on the respective GitHub repository.
Private Communication Channels: For less immediate vulnerabilities, reach out to trusted peers in the industry via secure, private channels to verify the issue before going public.
What to look for first?
Is the node up and running? Is the validator client up and running? CPU/RAM/Disk space okay?
Read the logs. Are there enough peers? Is the number of validators found by the validator client as you expected?
Being a scaled node operator comes with the responsibility of ensuring the network's security and efficiency. Adequate preparation and knowing precisely what to focus on when issues arise will make your incidence response effective and timely. Always remember, in times of incidents, swift action and clear communication are key.
Is your node in sync/is it syncing? If so, is it on the right fork? Take and check it against any public block explorer or in a community.
Is the network finalizing? -- should be moving every 6.2 minutes.
The information in the Scaled Node Operators section has been written and reviewed by and , a leading large scale Ethereum staking infrastructure provider.