Suspending system crashes lbrycrd #313

Open
opened 2019-09-05 16:05:12 +02:00 by nikooo777 · 8 comments
nikooo777 commented 2019-09-05 16:05:12 +02:00 (Migrated from github.com)

On my Kubuntu 18.04 system

[niko:~/work/repositories/ansible] master(+107/-95)+ 10m13s ± uname -a
Linux nikubuntu 4.20.17-042017-generic #201903190933 SMP Tue Mar 19 13:36:11 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

I left lbrycrd running before suspending the system to RAM. After resuming I found this in the logs after realizing lbrycrd had crashed:

2019-09-04T03:08:49Z UpdateTip: new best=bef7f20382b621d936d07298ceac08288b777bf5496191c31e7d57dd94dcc332 height=627855 version=0x20000000 log2_work=73.213445 tx=5695597 date='2019-09-04T03:08:25Z' progress=0.997478 cache=592.0MiB(641103txo)
2019-09-04T12:07:41Z socket receive timeout: 32204s
2019-09-04T12:07:41Z socket receive timeout: 32210s
2019-09-04T12:07:41Z socket receive timeout: 32225s
2019-09-04T12:07:41Z socket receive timeout: 32203s
2019-09-04T12:07:41Z socket receive timeout: 32201s
2019-09-04T12:07:41Z socket receive timeout: 32201s
2019-09-04T12:07:41Z socket receive timeout: 32200s
2019-09-04T12:07:41Z socket receive timeout: 32223s
2019-09-04T12:07:41Z 
************************
EXCEPTION: N5boost10wrapexceptINS_15condition_errorEEE       
boost::condition_variable::do_wait_until failed in pthread_cond_timedwait: Invalid argument       
lbrycrd in scheduler       
************************
EXCEPTION: N5boost10wrapexceptINS_15condition_errorEEE       
boost::condition_variable::do_wait_until failed in pthread_cond_timedwait: Invalid argument       
lbrycrd in scheduler       
terminate called after throwing an instance of 'boost::wrapexcept<boost::condition_error>'
  what():  boost::condition_variable::do_wait_until failed in pthread_cond_timedwait: Invalid argument
Aborted (core dumped)

The expected behavior would be that it continues operating normally

I had to reindex the whole chain to be able to start it again.

On my Kubuntu 18.04 system ``` [niko:~/work/repositories/ansible] master(+107/-95)+ 10m13s ± uname -a Linux nikubuntu 4.20.17-042017-generic #201903190933 SMP Tue Mar 19 13:36:11 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux ``` I left lbrycrd running before suspending the system to RAM. After resuming I found this in the logs after realizing lbrycrd had crashed: ``` 2019-09-04T03:08:49Z UpdateTip: new best=bef7f20382b621d936d07298ceac08288b777bf5496191c31e7d57dd94dcc332 height=627855 version=0x20000000 log2_work=73.213445 tx=5695597 date='2019-09-04T03:08:25Z' progress=0.997478 cache=592.0MiB(641103txo) 2019-09-04T12:07:41Z socket receive timeout: 32204s 2019-09-04T12:07:41Z socket receive timeout: 32210s 2019-09-04T12:07:41Z socket receive timeout: 32225s 2019-09-04T12:07:41Z socket receive timeout: 32203s 2019-09-04T12:07:41Z socket receive timeout: 32201s 2019-09-04T12:07:41Z socket receive timeout: 32201s 2019-09-04T12:07:41Z socket receive timeout: 32200s 2019-09-04T12:07:41Z socket receive timeout: 32223s 2019-09-04T12:07:41Z ************************ EXCEPTION: N5boost10wrapexceptINS_15condition_errorEEE boost::condition_variable::do_wait_until failed in pthread_cond_timedwait: Invalid argument lbrycrd in scheduler ************************ EXCEPTION: N5boost10wrapexceptINS_15condition_errorEEE boost::condition_variable::do_wait_until failed in pthread_cond_timedwait: Invalid argument lbrycrd in scheduler terminate called after throwing an instance of 'boost::wrapexcept<boost::condition_error>' what(): boost::condition_variable::do_wait_until failed in pthread_cond_timedwait: Invalid argument Aborted (core dumped) ``` The expected behavior would be that it continues operating normally I had to reindex the whole chain to be able to start it again.
BrannonKing commented 2019-09-05 17:42:17 +02:00 (Migrated from github.com)

Was this version 17.2.1? And was it an official build or a custom one?

Was this version 17.2.1? And was it an official build or a custom one?
bvbfan commented 2019-09-05 17:51:36 +02:00 (Migrated from github.com)

You use wireless or wired connection? On suspend network is suspended as well, when you wake up connection can be up again (if it's a kind of VPN it can take a long) say if network takes long that we have in wait, it will throw. After all on exception we should flush data as well, if we did data was corrupt.

You use wireless or wired connection? On suspend network is suspended as well, when you wake up connection can be up again (if it's a kind of VPN it can take a long) say if network takes long that we have in wait, it will throw. After all on exception we should flush data as well, if we did data was corrupt.
BrannonKing commented 2019-09-05 22:19:47 +02:00 (Migrated from github.com)

We cannot flush the disk buffers when any arbitrary exception kills the process. The exception may have come from the disk flush itself. However, if there is a specific one that we know doesn't affect the data on disk -- we could catch that one and restart that component or shut-down cleanly.

We cannot flush the disk buffers when any arbitrary exception kills the process. The exception may have come from the disk flush itself. However, if there is a specific one that we know doesn't affect the data on disk -- we could catch that one and restart that component or shut-down cleanly.
BrannonKing commented 2019-09-06 23:22:19 +02:00 (Migrated from github.com)

I'm unable to reproduce this with kill -STOP/-CONT. I'm unable to reproduce it with a few random suspensions during sync. I like @bvbfan 's theory about the slow network startup time. @nikooo777 , if this is easily reproducible for you, I have some things we can try. We can try builds with a few different versions of boost compiled in. We can also run a debug build and get the core dump, so that we know what the full stack for the error is.

I'm unable to reproduce this with `kill -STOP/-CONT`. I'm unable to reproduce it with a few random suspensions during sync. I like @bvbfan 's theory about the slow network startup time. @nikooo777 , if this is easily reproducible for you, I have some things we can try. We can try builds with a few different versions of boost compiled in. We can also run a debug build and get the core dump, so that we know what the full stack for the error is.
nikooo777 commented 2019-09-17 14:37:44 +02:00 (Migrated from github.com)

Sorry for the delayed answer. My PC is wired so I am unsure why this would have happened. It was also the first time of me seeing this.
Is this related? looks like so: https://github.com/bitcoin/bitcoin/issues/14200

Sorry for the delayed answer. My PC is wired so I am unsure why this would have happened. It was also the first time of me seeing this. Is this related? looks like so: https://github.com/bitcoin/bitcoin/issues/14200
bvbfan commented 2019-09-19 13:29:12 +02:00 (Migrated from github.com)

I've test it, at least 3 times and can't reproduce with my settings
WiFi - auto connect (priority 0), no VPN nor proxy, stored password no explicit user input.

I've test it, at least 3 times and can't reproduce with my settings WiFi - auto connect (priority 0), no VPN nor proxy, stored password no explicit user input.
nikooo777 commented 2019-11-25 12:30:29 +01:00 (Migrated from github.com)

I had this happen to me another time a couple of weeks ago, but it's rather sporadic and probably not worth investigating further as a node will not likely suspend every day.
The issue can be closed if you agree.

I had this happen to me another time a couple of weeks ago, but it's rather sporadic and probably not worth investigating further as a node will not likely suspend every day. The issue can be closed if you agree.
BrannonKing commented 2020-09-07 15:36:20 +02:00 (Migrated from github.com)

I have a theory that this is fixed here: https://github.com/bitcoin/bitcoin/pull/18284 . I'm going to bring it into the v19 build.

I have a theory that this is fixed here: https://github.com/bitcoin/bitcoin/pull/18284 . I'm going to bring it into the v19 build.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: LBRYCommunity/lbrycrd#313
No description provided.