KNOWLEDGE BASE

Coordination Service in Error State on One Node


Published: 21 Feb 2020
Last Modified Date: 22 Nov 2021

Issue

After applying pending changes, Tableau Server is degraded with the primary node showing Coordination Service in error state. 

The following may appear in the appzookeeper logs:
 
Thread-2 : ERROR org.apache.zookeeper.server.quorum.QuorumPeer - Unable to load database on disk
java.io.IOException: The accepted epoch, c is less than the current epoch, d

Environment

  • Tableau Server
  • HA environment (3 nodes) with a coordination service ensemble deployed

Resolution

Option 1 Redeploy the 3 node coordination service by performing the following actions one at a time:
  1. Deploy a new (temporary) Coordination Service ensemble that only includes the primary node (node 1).
  2. Clean up the old configuration.
  3. Re-deploy the Coordination Service Ensemble of 3 nodes for High Availability.
  4. Clean up the temporary configuration of a coordination node.
The documentation for deploying an ensemble can be found here:

Deploy a new Coordination Service ensemble

Option 2 Recreate zookeeper snapshot on node1.

0. Take a backup of Tableau Server by 'tsm maintenance backup'
1. Stop Tableau Server
    tsm stop
2. Run 'tsm status -v'.
    Make sure coordination services on node2, node3 are running status but node1 is error or unavailable. If not, don't do following.

3. Stop tsm services.
    sudo /opt/tableau/tableau_server/packages/scripts.<build number>/stop-administrative-services
4. Comfirm all Tableau releated services are NOT running including appzookeeper.
    ps -ef | grep appzookeeper
5. Rename zookeeper version-2 directory on node1 like follows
    cd /var/opt/tableau/tableau_server/data/tabsvc/appzookeeper/1
    sudo mv ./version-2 ./version-2.bk
6. Start tsm services
    sudo /opt/tableau/tableau_server/packages/scripts.<build number>/start-administrative-services
7. Run 'tsm status -v' to see if coordination service on node1 is running.

Cause

Possible root cause for the error is corruption of configuration files. This could happen if the primary node has disk related errors such as space exhaustion.

Additional Information

Please read the documentation as it includes a few critical points in the process, including but not limited to:
  1. Ensure that no changes are pending before beginning any change to the coordination ensemble
  2. After deploying the coordination service, be sure to wait until
    • Tableau Server has a status of STOPPED for each node.
    • The Admin Agent and Controller services are running as expected on each node
  3. Ensure that cleaning up the previous coordination service configuration is done while Tableau Server is stopped.

For more information, see: Deploy a new coordination ensemble

Did this article resolve the issue?