Solved: Re: C.12 Sync Manager Watchdog

AutomateSHANE · ‎04-03-2023

My SAFEX C.12 is configured and running an application. However, on the EtherCAT bus, it constantly goes to Safe-OP with the diagnostic ' 0x001B: Sync manager watchdog'. If I command the bus to OP again, it reaches OP state but then goes into Safe-OP after a random amount of time. All other slaves on the bus are ok. This is a small application for STO outputs with some hardwired inputs. That is, no FSoE for this project. I cannot find a potential cause nor remedy for this situation.

Filip_K · ‎04-11-2023

Hello,

We have got the same problem as you. We have got SafeX C.15 and there is the same error:

There are also some IndraDrives and S20 Bus Coupler with some IOs in this machine.

Safety program is also not very complicated.

In our situation error is also apearing randomly. The machine is now at the customers factory and it will start to work very soon and we must get rid of this problem. Restart of the EtherCAT communication resolves the problem.

What could cause this problem and what could be the solution?

Best regards, Filip

AllAutomation · ‎04-11-2023

@CodeShepherd, what root cause is most probably behind the error message '0x001B: Sync manager Watchdog'? How can we get nearer to the problem?

AllAutomation · ‎04-11-2023

Hello @Filip_K , Can you please give some more information about

- ctrlX CORE version and settings (EC cycle, etc.)

- ctrlX SAFETY version and settings (FSoE-Master and -Slaves)

Thanks

ctrlX SAFETY Team

AllAutomation · ‎04-11-2023

@Filip_K , @AutomateSHANE ,

the error is described inside the ctrlX CORE AL status list:

state change not executed or
internal error detected

Please provide the Log Book entries of the SAFEX-C.12. What happened internally, when the 0x001B has been indicated?

AutomateSHANE · ‎04-11-2023

@AllAutomation

The Log Book of the SAFEX is not very helpful. The entries are given based on operating time which must be rectified manually from the current time in seconds to days, hours, minutes, and seconds. It also does not sort in chronological order. I assume this is because it is a ring buffer. Anyways, there are not entires which correspond to the times that the 0x001B diagnostic appears.

The only two alarms which appear are concerning two outputs, and I have verified that there is nothing wrong with them. Aside from that, the most frequent entry is Info Number 10260: - ctrlX SAFETYlink: EC_ZB_TIMEOUT_INIT. This also does not make sense because I am not using the SAFETYlink connection.

You see that the latest entry is 4d 2h, yet today I am already in excess of 4d 6h and there are no new entries.

Filip_K · ‎04-12-2023

There is ctrlX Core C X3 with 1.18.4 firmware version. Motion task and safety task are both 5ms in the PLC Engineering. The EtherCAT cycle time is 2000us in IO Engineering.

The SafeX is C.15 controller. I don't know which firmware version is there but maybe you can read this out from the system. We didn't update the firmware.

We were using ctrlX Safety Engineering 1.7.1.8239. There are 2 HCS drives with SafeMotion connected via FSoE and one drive without SafeMotion.

Here are some more settings from SafeX.

I can't provide you logs from SafeX at the moment.

AllAutomation · ‎04-12-2023

Thank you @Filip_K ,

Hello @Filip_K and @AutomateSHANE ,

I got a feedback from development and should ask you: What is your Engineering access method?

Serial-over-USB
EoE (Eth1)
TCP/IP (Eth2)

Best regards

Your ctrlX SAFETY team

Filip_K · ‎04-12-2023

We were using the USB as the connection method with SafeX.

AutomateSHANE · ‎04-12-2023

@AllAutomation I have used both USB and EoE. It does not make a difference in regards to this error.

AllAutomation · ‎04-12-2023

Hello @AutomateSHANE and @Filip_K ,

We could not reproduce your case up to now. Does it make a difference whether you have a running Engineering connection or not?

If the Engineering connection is active, what is going on at that time? Debugging, Configuration download, ... In case of a configuration download the safe plc goes into Stop and it is disconnected from EtherCAT communication. Does this trigger something with you.

Could a video shed some more light on what is going on, where we may see something that we did not take into account.

Best regards

Your ctrlX SAFETY team

AutomateSHANE · ‎04-12-2023

@AllAutomation I had wondered if online edits or downloads via PLC Engineering were interfering with the bus communications. I am commissioning and debugging a machine, so this happens often. However, I stopped for a while and did nothing but monitor the bus status in IO Engineering. After some time between 5 and 10 minutes, the error occured again. Yes, a connection to the CORE from my PC was active with PLC Engineering and IO Engineering online, but nothing else was happening.

Filip_K · ‎04-13-2023

In our case engineering connection between PC and SafeX was not running when those errors occured and during the machine proccess. We were changing something in SafeX and then the we were unpluging the USB cable, so there were no communication there.

When the error occured the machine was stopping and SafeX LED goes to still orange.

We don't have any video but it won't show anything besides rapid stopping of the machine due to EtherCat communication going from OP to Pre-OP.

Today my colleague will call a guy which we were working with and ask him whether this problem is still present because we don't have any news about it for some days while we weren't there. I will keep you updated when I will get some info.

Maax · ‎04-13-2023

Hello all,

i have/ had the same problem with the SafeX C15 as a EtherCat Slave on the Core x3 Plus. The error occured all 5-10 minutes.

In my setup i don't use the "Fast Channel" of the C15. I figured out that out seems like the "empty" fast channel is causing the problem.

When i go to "Network Settings" on the C15 and set the "Fast Channel" from "SafetyLink" (Default Setting) to "none" i don't get the EtherCat error any more (since 2 days).

I don't know if this was the actual problem but for now i'm happy with the solution.

If you don't use the FastChannel in your setup you can try if this solves the problem on your application.

Regards

Max

Filip_K · ‎04-13-2023

Hello @Maax,

that's interesting information. We are also not using the SafetyLink, maybe changing it to "none" would also help us.

@AutomateSHANE- could you please give us some hints of connecting to SafeX using EoE? We didn't use it so far and I would be very grateful for any tips from your side or from anybody else 🙂

AutomateSHANE · ‎04-13-2023

@Maax That is very interesting that this was a solution for you. I am also not using FastChannel, but my setting is already 'None' and I still have this problem.

AutomateSHANE · ‎04-13-2023

@Filip_K to connect using EoE is not very difficult., it just requires several steps. From within IO Engineering, double-click the EtherCAT master, then select the EoE tab. Here, you must enabled the Virtual Ethernet port, and assign an IP address. You can do the same as I have set here:

Notice that on the right the master will automatically assign an address to all slaves which have EoE enabled using the range .1 to .253

Below, it will show all of the slaves which are enabled, and their information. For example, my SAFEX control is listed first:

In order for this, you must enabled EoE for the SAFEX. Double-click your SAFEX control in IO Engineering, then click the check box for 'Expert settings'. Now you will see an EoE tab for the SAFEX. From that tab, you can enable the Virtual Ethernet port for SAFEX. Make sure it is selected 'IP-Port'. Here you will see the IP address of the SAFEX.

In ctrlX SAFETY Engineering, navigate to 'Connection Settings' and select 'Ethernet'. Here you will enter the IP address that was displayed for the SAFEX.

Now, you need to enable forwarding on the CORE. Log in to your ctrlX CORE via the web interface. Select Settings and then Network Interfaces. For both XF10 (or XF51 if you are connected this way) and 'eoe0' select 'Enable IP forwarding'

The last step is set up routing on your PC. The IP address for the SAFEX is only on the EtherCAT bus. To connect using this, the Ethernet connection must be routed through the CORE. You must set up the routing table on your PC so that it knows the hop. In my case, my PC is connected to the XF10 port of my CORE, which is set to 192.168.1.1. Thus, I have set up a route through this address.

Open a Windows command prompt with administrator priviledges. Type: route -p add 172.31.254.0 MASK 255.255.0.0 192.168.1.1 (or whichever IP addresses match your configuration.

This adds a persistent route which will not change. If the address of your CORE will change in the future, or you only want to make the connection temporarily, do not include the '-p' in the command.

Now you should be able to connect to SAFEX control from SAFETY Engineering using EoE!

Maax · ‎04-25-2023

Hi all,

since two days the "Sync Manager Watchdog" error occures again on the EtherCAT, but i haven´t changed anything on the configuration.

Are there any news/ idesas?

Regards

Max

AllAutomation · ‎04-26-2023

Hello @Maax ,

Since the very first report we are investigating this behavior. Unfortunately we did not find the source of the problem, yet. But most recently under other circumstances this problem occurs similarly (e.g. during EoE transfers). The idea is now to analyse these cases and find your problems root cause with this.

If you have any further information what triggers that behaviour, please let us know, including getting in touch with us through a private message. Can you get in touch with your local Bosch Rexroth Technical support?

Best regards

Your ctrlX SAFETY team

Filip_K · ‎04-27-2023

Hello,

I don't have good news about our situation and the problem.

As far as I know this watchdog is still appearing. During one day watchdog appeared like 4 times. The 2-3 errors where when a guy was connected to the ctrlX Core. At first we thought that maybe this was causing the problem, but no. He disconnected from the controller and the error was still appearing. He told that behaviour after the error was quite weird because the cartesian robot was not stopping immediately after the error but it was finishing the cycle and was stopping before the next movement cycle.

Please tell us what we can provide you (programms, files, etc.) so maybe you will find something there?

AutomateSHANE · ‎04-27-2023

I'm still having the issue also. I have not been able to correlate it with anything. It seems very random. My local support team is aware of the issue, but nobody understands why. They did suggest to increase the EtherCAT bus cycle time, but this did not have an immediate noticable effect. At one point, without changing anything in particular, the problem disappeared. There was a stretch of several days wherein the error did not occur again. Then, yesterday, it started to come back again very frequently.

AllAutomation · ‎04-28-2023

Hello @AutomateSHANE ,

thanks to your assistance with the WireShark Recording we have now an indication, but still no direct root cause chain. The recording shows a high frequency of "EoE Transmission Errors" with "Missing EoE Fragment". This has no effect for a long time, but suddenly the SAFEX-C.1x switches to Safe-OP.

At the moment it is not identified what is producing this and how it results in switch to Safe-OP. If you do not need EoE, switching it off in the configuration might be a temporary work-around.

Which EtherCAT-Master are you using? The ctrlX CORE or a third-party controller?

Best regards

Your ctrlX SAFETY team

AutomateSHANE · ‎04-28-2023

Thanks. I am using CORE X3 for the EtherCAT Master. EoE was enabled for ease of commissioning (of SAFEX and EFC drives), but not being used at the moment. I will switch it off.

AllAutomation · ‎05-03-2023

Thanks @AutomateSHANE ,

Hello @Filip_K and @Maax ,

are you able to create a Wireshark recording during the occurence of '0x001B: Sync Manager Watchdog'?

The recommended configuration has been created by my development colleague and at least @Filip_K has been asked in a private message for:

Such a recording can help us quite a lot, since we do not have the clear connection between some EoE errors and the drop of the SAFEX-C.1x from OP to Safe-OP.

Please add all relevant information to your reply:

EtherCAT-Master and Version
SAFEX-C.1x FW Version (System Info)

Thank you for your support

Your ctrlX SAFETY team

AllAutomation · ‎05-10-2023

Hello @AutomateSHANE , @Filip_K , @Maax ,

we do not understand the mechanism, how this error is triggered, yet. We see several applications where it occurs and see a kind of relation to EoE, but do not understand the real mechnism.

We know the error is generated by the device application (firmware) and no automatic reaction of the EtherCAT Slave FPGA logic.

Additionally we know the device can be recovered immediately by commanding OP for the devices and the device is back in operation. Our colleague @Dias proposed a method to programmatically recover from e.g. '0x001B Sync manager watchdog'. He's been asked to share it via the community. If you want to use that workaround before it is publicly shared, please contact @Dias via private message.

Best regards

Your ctrlX SAFETY team

Dias · ‎05-11-2023

Hello,

as an workaround I created a short programm which can be imported:

You need the following Lib for this:

This program is just an example and I hope it is helpful for you.

Filip_K · ‎05-17-2023

Hello everyone,

Sorry for not answering for quite some time. The problem is still sometimes occuring, rather in unexpected way - there is not any pattern of anything like that with this SafeX issue.

We will try to implement this function block to our program and we will give you some feedback about the situation.

We will also try to do the WireShark recording and send you results and also other informations about Firmware which you asked in previous messages.

Best regards, Filip

Filip_K · ‎05-24-2023

Hello @Dias,

One quick question about your program.

Could you answer if the variable "uiEthercatAddr" should be set to eg. 1005 (SafeX EtherCat Address) or it should be always 1001 as in your program? I see the comment above but I want to be sure.

And also if this variable should be changed to desired EtherCat Address - should it also be changed in the PLC program below? Not only in variables?

I just run this program in my test ctrlX Core but I am not able to reproduce this SafeX error so I can't test it 100% and I want to clarify this solution and be sure that we are doing things correctly so that we will not waste time later.

Thanks for answer,

Best regards, Filip

AutomateSHANE · ‎05-31-2023

Hi @Filip_K ,

I don't see that Dias has replied here, so I will answer in case you still need help. The variable you are questioning is initialized to 1001 because that is the address of the first device on the ECAT bus. The address of the other devices increments from there. What this program is doing is constantly checking the EtherCAT Master state. If it is in error, specifically for a topology problem, it then proceeds to read the number of slaves which are configured on the bus. Next, it runs a loop to cycle through each slave and check whether the slave is detected on the bus or not. If any slave is not detected in Operational state, it sends a command to the master to set the bus state to Op, forcing any slaves which may be stuck in Init or Safe-Op (the SAFEX, in this case) to go to the Operational state.

The section of the code you highlighted serves to re-initialize this loop back to the starting address of 1001 as soon as it has reached the address of the last slave on the bus. For example, if you have 5 slaves on the bus the addresses would be 1001, 1002, 1003, 1004, 1005. The value of fbIL_ECATGetNumConfigSlaves.NumConfigSlaves would be 5. So, as soon as the loop index (uiEthercatAddr) is equal to 1006 (1001+5), it resets the loop index back to the start of 1001, before executing the command for a slave address which doesn't exist.

In short, you do not want to change it to 1004. The code was written to work for any configuration. Hopefully this adds some clarity.

Filip_K · ‎06-01-2023

Hello,

Yesterday I was uploading this code. I read a bit of the library documentation and review the code more detailed (because at first I only looked at it very roughly) and my conclusions were exactly the same as your answer so I left the 1001 Ethercat Address. At first I didn't see that below there is increment of this variable and that it is working in a loop and checking all addresses as you said.

I will keep you updated about the situation. As far as I know the problem was still present for the last time. Now we will try with the program which Dias wrote and check if it is better ;).

Thanks for your answer and have a nice day!

AutomateSHANE · ‎07-26-2023

Is there any evolution on this topic? Recently, this problem started to occur again at random, even with EoE disabled. I see that new SAFEX firmware has become available this week (1.0.1.45) but I do not see anything in the release notes pertaining to this problem.

Filip_K · ‎07-27-2023

I have got good news from our site - it looks like the problem is not present any more (at least for the moment). The machine is working properly, SafeMotion is working good, this error is not occuring anymore.

Few weeks ago we were on customer site to do some stuff with the machine. The first thing what we have done was to open the control cabinet and press the Ethernet connectors to be sure that they are mounted correctly. We saw that some of there were a bit loose before. We don't know if it was a problem and if there were connection problems (maybe some packets lost or something else...) but after that and after updating the ctrlX Core to 1.20 firmware version the problem disappeard.

Maybe try to do the same with your machine...

That's all what I can tell you. As I said - we didn't do any magic things and the problem disappeared.

AllAutomation · ‎07-29-2023

Hello @AutomateSHANE ,

yes there is some progress, but unfortunatly not as part of the V1.0.1.45-212.

Please check further release/patches for Bug ID #654893 ("SAFEX-C.1x: EtherCAT communication state drops to SAFE-OP at random amount of time") inside the Extended Release Notes. According to our findings during the recent two weeks the behavior is triggered by EoE- and CoE-object-access activities. This is a Priority 1 issue at our side.

At the moment we are not able to present a date, when a solution/fix will be available.

@Dias , @ChrM , @M_Mohann, @SaDiego : FYI

Best regards

Your ctrlX SAFETY team