Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

srs crash,edge状态异常,terminate called after throwing an instance of 'std::bad_alloc' #509

Closed
han4235 opened this issue Oct 22, 2015 · 16 comments
Labels
Bug

Comments

@han4235
Copy link

@han4235 han4235 commented Oct 22, 2015

srs 在拉流的时候出现自动退出
srs: src/app/srs_app_edge.cpp:766: virtual int SrsPlayEdge::on_ingest_play(): Assertion `state == SrsEdgeStatePlay' failed.

@jarod
Copy link

@jarod jarod commented Oct 22, 2015

我也遇到了 2.0a2版

@winlinvip
Copy link
Member

@winlinvip winlinvip commented Oct 22, 2015

请提供配置,日志,版本,重现步骤。谢谢~

@winlinvip winlinvip added the Bug label Oct 22, 2015
@winlinvip winlinvip added this to the srs 2.0 release milestone Oct 22, 2015
@jarod
Copy link

@jarod jarod commented Oct 23, 2015

我的配置很简单,就是自带配置文件的origin.conf和edge.conf, 把edge.conf里的origin改为了自己的服务器域名。1个origin,2个edge。有5个左右的推流和50个左右的拉流。 origin和edge都出现过挂掉。
edge的日志和楼上的一样,origin的日志是这个:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

如果有需要可以提供core dump文件

@winlinvip
Copy link
Member

@winlinvip winlinvip commented Oct 23, 2015

请将core和对应的srs发给我吧,放网盘上也可以,是centos么?

@jarod
Copy link

@jarod jarod commented Oct 23, 2015

centos 7 64bit的, 相关文件 http://pan.baidu.com/s/1pJGLnyN

@winlinvip
Copy link
Member

@winlinvip winlinvip commented Oct 23, 2015

嗯,我找时间看看。

@winlinvip winlinvip changed the title srs 自动退出 srs crash,edge状态异常 Dec 22, 2015
@winlinvip
Copy link
Member

@winlinvip winlinvip commented Dec 22, 2015

[winlin@centos7 srs]$ ./objs/srs -v
2.0.195
[winlin@centos7 srs]$ ls -lh core.*
-rw-------. 1 winlin winlin 1.1G Oct 22 21:10 core.13964
-rw-------. 1 winlin winlin 2.1G Oct 22 22:16 core.31521

(gdb) f 2
#2  0x00000000004f884b in SrsEdgeIngester::cycle (this=0x455bb50) at src/app/srs_app_edge.cpp:138
warning: Source file is more recent than executable.
138     if ((ret = client->handshake()) != ERROR_SUCCESS) {
(gdb) p this[0]
$3 = {<ISrsReusableThread2Handler> = {_vptr.ISrsReusableThread2Handler = 0x898e50 <vtable for SrsEdgeIngester+16>}, stream_id = 1, _source = 
    0x2327650, _edge = 0x2193580, _req = 0x23982a0, pthread = 0x34d09c0, stfd = 0x16ee220, io = 0x3b63e20, kbps = 0x3b5e190, client = 0x34cd3a0, 
  origin_index = 0}

可见edge对象没有破坏。

(gdb) f 0
#0  0x00000000004500d0 in SrsComplexHandshake::handshake_with_server (this=0x7f5ff5a5cc00, hs_bytes=0x4341d00, io=0x3b63e20)
    at src/protocol/srs_rtmp_handshake.cpp:1341
1341        if ((ret = hs_bytes->read_s0s1s2(io)) != ERROR_SUCCESS) {

(gdb) p hs_bytes[0]
$7 = {_vptr.SrsHandshakeBytes = 0x191f860, c0c1 = 0x7f603a4727d8 <main_arena+120> "\310'G:`\177", s0s1s2 = 0x0, c2 = 0x0}

可见对象已经释放了,再使用肯定会有问题。

@winlinvip
Copy link
Member

@winlinvip winlinvip commented Dec 22, 2015

(gdb) bt
#0  0x00000000004500d0 in SrsComplexHandshake::handshake_with_server (this=0x7f5ff5a5cc00, hs_bytes=0x4341d00, io=0x3b63e20)
    at src/protocol/srs_rtmp_handshake.cpp:1341
#1  0x0000000000433889 in SrsRtmpClient::handshake (this=0x34cd3a0) at src/protocol/srs_rtmp_stack.cpp:1978
#2  0x00000000004f884b in SrsEdgeIngester::cycle (this=0x455bb50) at src/app/srs_app_edge.cpp:138
#3  0x00000000004a355d in SrsReusableThread2::cycle (this=0x34d09c0) at src/app/srs_app_thread.cpp:533
#4  0x00000000004a2557 in internal::SrsThread::thread_cycle (this=0x1b5b710) at src/app/srs_app_thread.cpp:203
#5  0x00000000004a2769 in internal::SrsThread::thread_fun (arg=0x1b5b710) at src/app/srs_app_thread.cpp:244
#6  0x000000000051643e in _st_thread_main () at sched.c:327
#7  0x0000000000516bae in st_thread_create (start=0x12f5105, arg=0xfbad8001, joinable=32608, stk_size=974285335) at sched.c:591
#8  0x0000000000000000 in ?? ()
(gdb) 

堆栈。

@winlinvip
Copy link
Member

@winlinvip winlinvip commented Dec 22, 2015

(gdb) p hs_bytes[0]
$4 = {_vptr.SrsHandshakeBytes = 0x191f860, c0c1 = 0x7f603a4727d8 <main_arena+120> "\310'G:`\177", s0s1s2 = 0x0, c2 = 0x0}

说明是完成了c0c1,但是还没有收到s0s1s2。这个是不可能的执行路径:



    // s0s1s2
    if ((ret = hs_bytes->read_s0s1s2(io)) != ERROR_SUCCESS) {
        return ret;
    }

    // plain text required.
    if (hs_bytes->s0s1s2[0] != 0x03) {
        ret = ERROR_RTMP_HANDSHAKE;
        srs_warn("handshake failed, plain text required. ret=%d", ret);
        return ret;
    }

int SrsHandshakeBytes::read_s0s1s2(ISrsProtocolReaderWriter* io)
{
    int ret = ERROR_SUCCESS;

    if (s0s1s2) {
        return ret;
    }

    ssize_t nsize;

    s0s1s2 = new char[3073];
    if ((ret = io->read_fully(s0s1s2, 3073, &nsize)) != ERROR_SUCCESS) {
        srs_warn("read s0s1s2 failed. ret=%d", ret);
        return ret;
    }
    srs_verbose("read s0s1s2 success.");

    return ret;
}

SrsHandshakeBytes::read_s0s1s2返回时,肯定s0s1s2是非NULL的了。

@winlinvip
Copy link
Member

@winlinvip winlinvip commented Dec 22, 2015

再次观察hs_bytes:

(gdb) p hs_bytes[0]
$5 = {_vptr.SrsHandshakeBytes = 0x191f860, c0c1 = 0x7f603a4727d8 <main_arena+120> "\310'G:`\177", s0s1s2 = 0x0, c2 = 0x0}
(gdb) x /12xb hs_bytes->c0c1
0x7f603a4727d8 <main_arena+120>:    0xc8    0x27    0x47    0x3a    0x60    0x7f    0x00    0x00
0x7f603a4727e0 <main_arena+128>:    0xc8    0x27    0x47    0x3a

其中c0应该是0x03,实际上是0xc8
而c0c1的指针是0x7f603a4727d8,这个肯定是栈指针,但实际上应该是堆指针。
从这两个来看,hs_bytes是个野指针。

@winlinvip
Copy link
Member

@winlinvip winlinvip commented Dec 22, 2015

在看堆栈:

(gdb) f 1
#1  0x0000000000433889 in SrsRtmpClient::handshake (this=0x34cd3a0) at src/protocol/srs_rtmp_stack.cpp:1978
1978        if ((ret = complex_hs.handshake_with_server(hs_bytes, io)) != ERROR_SUCCESS) {
(gdb) p hs_bytes[0]
$9 = {_vptr.SrsHandshakeBytes = 0x8917f0 <vtable for SrsHandshakeBytes+16>, c0c1 = 0x3ce7ab0 "\003V(\340D\200", 
  s0s1s2 = 0x4080ec0 "\003V(\340B\001", c2 = 0x0}

这个地方看到的hs_bytes是和之前的不一样的,这个说明了complex_hs.handshake_with_server里面发生了问题。而在f1这个地方,c0c1是堆指针,而且数据是03开头的,没有破坏。

@winlinvip
Copy link
Member

@winlinvip winlinvip commented Dec 22, 2015

(gdb) p ((SrsStSocket*)io)[0]
$15 = {<ISrsProtocolReaderWriter> = {<ISrsProtocolReader> = {<ISrsBufferReader> = {
        _vptr.ISrsBufferReader = 0x895e20 <vtable for SrsStSocket+96>}, <ISrsProtocolStatistic> = {
        _vptr.ISrsProtocolStatistic = 0x895eb0 <vtable for SrsStSocket+240>}, <No data fields>}, <ISrsProtocolWriter> = {<ISrsBufferWriter> = {
        _vptr.ISrsBufferWriter = 0x895f18 <vtable for SrsStSocket+344>}, <No data fields>}, <No data fields>}, recv_timeout = 30000000, 
  send_timeout = 30000000, recv_bytes = 3073, send_bytes = 1537, stfd = 0x16ee220}

从io的数据来看,收到了3073字节(s0s1s2),发送了1537字节(c0c1),可能是在处理s0s1s2的时候出的问题。

@winlinvip
Copy link
Member

@winlinvip winlinvip commented Dec 22, 2015

这块可能是栈上开辟对象导致的问题,改成堆开辟吧。

@winlinvip winlinvip closed this in 5d3a183 Dec 22, 2015
@winlinvip winlinvip changed the title srs crash,edge状态异常 srs crash,edge状态异常,terminate called after throwing an instance of 'std::bad_alloc' Oct 26, 2020
@winlinvip
Copy link
Member

@winlinvip winlinvip commented Oct 26, 2020

https://stackoverflow.com/a/2504601
bad_alloc基本上是无法分配,从core的大小看,是比较长时间运行的服务。

If you are running on a typical embedded processor running Linux without virtual memory it is quite likely 
your process will be terminated by the operating system before new fails if you allocate too much memory.

If you are running your program on a machine with less physical memory than the maximum of virtual 
memory (2 GB on standard Windows) you will find that once you have allocated an amount of memory 
approximately equal to the available physical memory, further allocations will succeed but will cause 
paging to disk. This will bog your program down and you might not actually be able to get to the point 
of exhausting virtual memory. So you might not get an exception thrown.

If you have more physical memory than the virtual memory, and you simply keep allocating memory, 
you will get an exception when you have exhausted virtual memory to the point where you can not 
allocate the block size you are requesting.

If you have a long-running program that allocates and frees in many different block sizes, including 
small blocks, with a wide variety of lifetimes, the virtual memory may become fragmented to the point 
where new will be unable to find a large enough block to satisfy a request. Then new will throw an 
exception. If you happen to have a memory leak that leaks the occasional small block in a random 
location that will eventually fragment memory to the point where an arbitrarily small block allocation 
will fail, and an exception will be thrown.

If you have a program error that accidentally passes a huge array size to new[], new will fail and throw 
an exception. This can happen for example if the array size is actually some sort of random byte pattern, 
perhaps derived from uninitialized memory or a corrupted communication stream.
@winlinvip
Copy link
Member

@winlinvip winlinvip commented Oct 26, 2020

这个文章分析了bad_alloc并非总是OOM:http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1404r1.html

写了一个实例,如下:

/*
ulimit -S -v 204800
g++ -g -O0 t.cpp -o t && ./t
*/
#include <stdio.h>
int main(){
    char* p1 = new char[193000 * 1024]; // huge allocation
    char* p0 = new char[100 * 1024]; // small allocation
    printf("OK\n");
}

执行就会崩溃:

[root@SRS tmp]# ulimit -S -v 204800
[root@SRS tmp]# g++ -g -O0 t.cpp -o t && ./t
terminate called after throwing an instance of 'St9bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)

[root@SRS tmp]# ll core.21082 
-rw------- 1 root root 198045696 Oct 26 21:04 core.21082

看堆栈不是分配大头,而是分配小头的地方:

[root@SRS tmp]# gdb t -c core.21082 
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-92.el6)
Copyright (C) 2010 Free Software Foundation, Inc.

warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7ffeae793000
Core was generated by `./t'.
Program terminated with signal 6, Aborted.
#0  0x00007fd17ff0e4f5 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.212.el6_10.3.x86_64 libgcc-4.4.7-23.el6.x86_64 libstdc++-4.4.7-23.el6.x86_64
(gdb) bt
#0  0x00007fd17ff0e4f5 in raise () from /lib64/libc.so.6
#1  0x00007fd17ff0fcd5 in abort () from /lib64/libc.so.6
#2  0x00007fd1807c8a8d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
#3  0x00007fd1807c6be6 in ?? () from /usr/lib64/libstdc++.so.6
#4  0x00007fd1807c6c13 in std::terminate() () from /usr/lib64/libstdc++.so.6
#5  0x00007fd1807c6d32 in __cxa_throw () from /usr/lib64/libstdc++.so.6
#6  0x00007fd1807c712d in operator new(unsigned long) () from /usr/lib64/libstdc++.so.6
#7  0x00007fd1807c71e9 in operator new[](unsigned long) () from /usr/lib64/libstdc++.so.6
#8  0x0000000000400624 in main () at t.cpp:8
(gdb) f 8
#8  0x0000000000400624 in main () at t.cpp:8
8	    char* p0 = new char[100 * 1024]; // small allocation
(gdb) 
@winlinvip
Copy link
Member

@winlinvip winlinvip commented Oct 31, 2020

加了一个gdb的脚本,分析了下core中的coroutine的数目,代码srs.py先下载下来:

(gdb) source gdb/srs.py 
(gdb) nn_coroutines 
this coroutine(&_st_this_thread->tlink) is: 0x7f43ba761e78
next is 0x7f43b92d9e78, total 500
next is 0x7f43b5c37e78, total 1000
next is 0x7f43bfd71e78, total 31500
next is 0x7f43bdad9e78, total 32000
next is 0x7f43bd8f3e78, total 32500
total coroutines: 32717

ST默认是用mmap开辟coroutine的栈空间,所以超过一定数量就会失败,这个数量可以通过这个查看:

[root@05ff04a933cd st]# sysctl vm.max_map_count
vm.max_map_count = 65530

注意:Docker中这个限制是不生效的,最多可以开到650162个coroutine,内存占用40GB左右。一般线上的机器都会打开这个限制。

然后编译这个代码huge-threads.cpp,执行:

g++ huge-threads.cpp ../../objs/st/libst.a -g -O0 -o huge-threads && 
./huge-threads 60000

一般就会挂在3万个左右coroutine这里:

[root@05ff04a933cd st]# ./huge-threads 60000
pid=77682, create 60000 coroutines
create thread fail, i=32749

这个解法有两个:

  1. 需要看下Source没有清理时,coroutine会那么多。
  2. 可以编译时打开MALLOC_STACK
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants