EventSubscriber DS crashes
|
|
---|---|
Hello For some reason my hdb++es device server crashes after all RAM is cached by Linux. I have the same situation with disk caching on the other VM with older Debian and Tango, but EventSubscriber works fine there, so I think it may be connected to my previous questions about Java packages, that were not noticed. Here is the topic How can I find out why DS is crashing, if everything is OK in the Tango logs? How can I start archiving attributes after restarting ES DS, because they all become "stopped attributes"? Pictures from older VM and newer VM are in the attachments. ES Device server was compiled from the GitHub source code. |
|
|
---|---|
Hello Diego Did you try to start your hdb++es from shell with -v5 option to know where it crashes ? |
|
|
---|---|
Hello Pascal. I launched DS with -v5 option, it took about 4 days for server to crash again. This is a very helpful option, but I've found only a couple of mentions about it - in the HDB++ description, one topic on the forum and some slides from Tango school. Shouldn't it be described somewhere in the main documentation, e.g. in the Developer's Guide? Back to the topic - I received many "insert error" messages, during all DS lifetime. I haven't noticed it before, because I ran the DS in the background. All details are in the attached pictures. Finally, the proccess was killed without any fall out message, and what is strange - output for command "free -m" tried to execute in the console. Yes, I've executed this command last week, but the DS crashed at 3.30 am on Sunday, so that's a mystery for me. I've relaunched HDB++ES server and redirected debug output to the file and so far collected 600 MB of data since yesterday evening. I suppose that on Thursday or Friday the DS will be killed again, so I can upload the file somewhere for you to have a look. |
|
|
---|---|
Diego There is the notion of archiving strategies in HDB++. An archiving strategy is here to define in which archiving contexts a given attribute should be archived. In the HDB++ ES device interface, there is a memorized attribute named "Context", this attribute corresponds to the current archiving context. If the attributes which are handled by the ES have the current Context in their archiving strategy (which is a list of contexts), then the event subscriber will archive all these attributes. Since "Context" attribute is a memorized attribute, it is set during the init phase of the ES. So if the ES is restarted, it will restart automatically to archive all the attributes which should be archived in the memorized Context. This automatic restart does not currently work if the Context attribute has never been initialized. This is a known bug we intend to fix because attributes having a strategy set to ALWAYS should always be archived, even in this case. Simply sets the Context attribute of your ES to "ALWAYS" and the automatic restart should work in your case. You can get more details about the context in this section of the documentation: https://tango-controls.readthedocs.io/en/latest/administration/services/hdbpp/hdb++-design-guidelines.html?highlight=context#hdb-tango-device-server Hoping this helps at least for this specific question…
Rosenberg's Law: Software is easy to make, except when you want it to do something new.
Corollary: The only software that's worth making is software that does something new. |
|
|
---|---|
Diego You're right! Contributions are welcome Contributing to the documentation is very easy: https://tango-controls.readthedocs.io/en/latest/development/contributing/documentation-guide.html Diego The insert errors are not normal. Can you give us more details about your event subscriber configuration? How many attributes does it handle? What are the types of these attributes? Diego Logging -v5 output to a log file is not really a recommended method to debug a device server which crashes after several days. The log file will be huge and you might run out of disk space. The problem you are seeing is probably in libhdbpp-mysql module and seems to correspond to a memory leak introduced when there is an error in mysql_stmt_bind_param() It looks like the prepared statements are not freed (closed) in this case (this is probably why you are seeing errors like "Can't create more than max_prepared_stmt_count statements") and some buffers might not be freed as they should in some error cases. I created https://github.com/tango-controls-hdbpp/libhdbpp-mysql/issues/8 to track this problem. Please feel free to give more details to libhdbpp-mysql developers there. Hoping this helps,
Rosenberg's Law: Software is easy to make, except when you want it to do something new.
Corollary: The only software that's worth making is software that does something new. |
|
|
---|---|
Hi Diego. I confirm there is a leak when mysql_stmt_bind_param fails. While I am fixing it we should in any case understand why it fails. I would like to check the structure of your att_error_desc MySQL table. So can you post the result of "show create table att_error_desc;" executed from within your MySQL, hdbpp database? Graziano |
|
|
---|---|
Hi sorry for delayed reply, I was on holidays. I've collected logs from DS with -v5 option to the file (it is not so huge, about 4.31 GB), you can download it here, if it's necessary. "show create table att_error_desc" output:
My another HDB++ DB on a different VM doesn't have such table, but it was created about 2.5 years ago and I think the way how all HDB++ system works has significantly changed. I use MariaDB 10.3.12, not MySQL, because I have Debian 9.7 now. So maybe the whole DB schema was created incorrectly and I have to recreate it, because I can't even open HDBViewer to view the data that I have already archived. |
|
|
---|---|
FYI, the memory leak issue has been fixed by Graziano and the fix is already in the master branch of libhdbpp-mysql github repository.
Rosenberg's Law: Software is easy to make, except when you want it to do something new.
Corollary: The only software that's worth making is software that does something new. |