## UCloud云主机 https://console.ucloud.cn/ 账户密码 FS12345678 ## 环境准备 **Postgres** ```sh apt update apt install postgresql postgresql-contrib su postgres > psql > # alter user postgres with password 'ROOT'; vi /etc/postgresql/9.5/main/pg_hba.conf # host all all 10.60.178.0/24 md5 service postgresql restart createdb iOTA_console psql -d iOTA_console < dump.sql ``` **Docker** ```sh curl -sSL https://get.daocloud.io/docker | sh ``` **Redis** 因为redis默认端口暴露在外网环境不安全,启动ubuntu防火墙 ```sh ufw enable ufw status # 默认允许外部访问本机 ufw default allow # 禁止6379端口外部访问 ufw deny 6379 # 其他一些 # 允许来自10.0.1.0/10访问本机10.8.30.117的7277端口 ufw allow proto tcp from 10.0.1.0/10 to 10.8.30.117 7277 Status: active To Action From -- ------ ---- 6379 DENY Anywhere 6379 (v6) DENY Anywhere (v6) ``` 开放了防火墙,外网还是无法访问开放的端口。进入ucloud控制台, 基础网络UNet > 外网防火墙 > 创建防火墙 (自定义规则) 开放所有tcp端口,只禁用redis-6379 ![image-20211122152046659](imgs/UCloud-DAC上云测试/image-20211122152046659.png) 云主机UHost > 关联资源操作 > 更改外网防火墙 ![image-20211122152136855](imgs/UCloud-DAC上云测试/image-20211122152136855.png) 安装redis ```sh apt update apt install redis-server ``` ## 引流测试 机房搬迁,准备在云上运行单实例dac进行数据采集。 准备工作:进行线上引流测试。不影响商用dac的采集,准备如下: 1. proxy上被动连接转发到UCloud。 1. 流单向复制。设备 -> proxy -> DAC通路, 开路:DAC->proxy-|->设备。 2. 主动连接 1. mqtt、http主动连接第三方服务器的, 2. mqtt 的clientid添加后缀 3. 截断driver的写入 关键代码 ```go // io.copy无法多次执行 // 如果配置了OutTarget,则进行本地复制到同时向外复制流 func Pipeout(conn1, conn2 net.Conn, port string, wg *sync.WaitGroup, reg []byte) { if OutTarget != "" { tt := fmt.Sprintf("%s:%s", OutTarget, port) tw := NewTeeWriter(tt, reg) tw.Start() if _, err := io.Copy(tw, io.TeeReader(conn2 /*read*/, conn1 /*write*/)); err != nil { log.Error("pipeout error: %v", err) } tw.Close() } else { io.Copy(conn1, conn2) } conn1.Close() log.Info("[tcp] close the connect at local:%s and remote:%s", conn1.LocalAddr().String(), conn1.RemoteAddr().String()) wg.Done() } // 引流写入器 type TeeWriter struct { target string // 转发目标地址 conn net.Conn // 转发连接 isConnect bool // 是否连接 exitCh chan interface{} // 退出 registry []byte } func NewTeeWriter(target string, reg []byte) *TeeWriter { return &TeeWriter{ target: target, exitCh: make(chan interface{}), registry: reg, } } func (w *TeeWriter) Start() error { go w.keep_connect() return nil } func (w *TeeWriter) Close() error { close(w.exitCh) return nil } func (w *TeeWriter) Write(p []byte) (n int, err error) { defer func() { if err := recover(); err != nil { log.Error("teewrite failed %s", w.target) } }() if w.isConnect { go w.conn.Write(p) } // 此方法永远不报错 return len(p), nil } func (w *TeeWriter) keep_connect() { defer func() { if err := recover(); err != nil { log.Error("teewrite keep connect error: %v", err) } }() for { if cont := func() bool { var err error w.conn, err = net.Dial("tcp", w.target) if err != nil { select { case <-time.After(time.Second): return true case <-w.exitCh: return false } } w.isConnect = true defer func() { w.isConnect = false }() defer w.conn.Close() if w.registry != nil { _, err := w.conn.Write(w.registry) if err != nil { return true } } if err := w.conn.(*net.TCPConn).SetKeepAlive(true); err != nil { return true } if err := w.conn.(*net.TCPConn).SetKeepAlivePeriod(30 * time.Second); err != nil { return true } connLostCh := make(chan interface{}) defer close(connLostCh) // 检查远端bconn连接 go func() { defer func() { log.Info("bconn check exit") recover() // write to closed channel }() one := make([]byte, 1) for { if _, err := w.conn.Read(one); err != nil { log.Info("bconn disconnected") connLostCh <- err return } time.Sleep(time.Second) } }() select { case <-connLostCh: time.Sleep(10 * time.Second) return true case <-w.exitCh: return false } }(); !cont { break } else { time.Sleep(time.Second) } } } ``` 引流测试未执行。。。 ## DAC线上测试 配置如下 ```json ``` 需要配置 `url.maps.json` ```json "47.106.112.113:1883" "47.104.249.223:1883" "mqtt.starwsn.com:1883" "test.tdzntech.com:1883" "mqtt.tdzntech.com:1883" "s1.cn.mqtt.theiota.cn:8883" "mqtt.datahub.anxinyun.cn:1883" "218.3.126.49:3883" "221.230.55.28:1883" "anxin-m1:1883" "10.8.25.201:8883" "10.8.25.231:1883" "iota-m1:1883" ``` 以下数据无法获取: 1. gnss数据 http.get error: Get "http://10.8.25.254:7005/gnss/6542/data?startTime=1575443410000&endTime=1637628026000": dial tcp 10.8.25.254:7005: i/o timeout 2. 时 ## DAC内存问题排查 > 文档整理不够清晰,可以参考 https://www.cnblogs.com/gao88/p/9849819.html > > pprof的使用: > > https://segmentfault.com/a/1190000020964967 > > https://cizixs.com/2017/09/11/profiling-golang-program/ 查看进程内存消耗: ```sh top -c # shift+M top - 09:26:25 up 1308 days, 15:32, 2 users, load average: 3.14, 3.70, 4.37 Tasks: 582 total, 1 running, 581 sleeping, 0 stopped, 0 zombie %Cpu(s): 5.7 us, 1.5 sy, 0.0 ni, 92.1 id, 0.0 wa, 0.0 hi, 0.8 si, 0.0 st KiB Mem : 41147560 total, 319216 free, 34545608 used, 6282736 buff/cache KiB Swap: 0 total, 0 free, 0 used. 9398588 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 18884 root 20 0 11.238g 0.010t 11720 S 48.8 26.7 39:52.43 ./dac ``` 发现dac内存咱用超10G 查看所在容器: ```sh root@iota-n3:/home/iota/etwatcher# systemd-cgls | grep 18884 │ │ ├─32574 grep --color=auto 18884 │ │ └─18884 ./dac ``` ```sh for i in $(docker container ls --format "{{.ID}}"); do docker inspect -f '{{.State.Pid}} {{.Name}}' $i; done | grep 18884 ``` 定位到 dac-2 > 查看指定容器的pid可以使用“ > > docker top container_id > > 获取所有容器的PID > > ```sh > for l in `docker ps -q`;do docker top $l|awk -v dn="$l" 'NR>1 {print dn " PID is " $2}';done > ``` > > 通过docker inspect方式 > > ```sh > docker inspect --format "{{.State.Pid}}" container_id/name > ``` 查看dac-2容器信息 ```sh root@iota-n3:~# docker ps | grep dac-2 05b04c4667bc repository.anxinyun.cn/iota/dac "./dac" 2 hours ago Up 2 hours k8s_iota-dac_iota-dac-2_iota_d9879026-465b-11ec-ad00-c81f66cfe365_1 be5682a82cda theiota.store/iota/filebeat "filebeat -e" 4 hours ago Up 4 hours k8s_iota-filebeat_iota-dac-2_iota_d9879026-465b-11ec-ad00-c81f66cfe365_0 f23499bc5c22 gcr.io/google_containers/pause-amd64:3.0 "/pause" 4 hours ago Up 4 hours k8s_POD_iota-dac-2_iota_d9879026-465b-11ec-ad00-c81f66cfe365_0 c5bcbf648268 repository.anxinyun.cn/iota/dac "./dac" 6 days ago Up 6 days k8s_iota-dac_iota-dac-2_iota_2364cf27-41a0-11ec-ad00-c81f66cfe365_0 ``` > 有两个?(另外一个僵尸进程先不管) 进入容器: ```sh docker exec -it 05b04c4667bc /bin/ash ``` > 容器里没有 curl命令? > > 使用 wget -q -O - https://www.baidu.com 直接输出返回结果 在宿主机: ```sh go tool pprof -inuse_space http://10.244.1.235:6060/debug/pprof/heap # top 查看当前内存占用top10 (pprof) top Showing nodes accounting for 913.11MB, 85.77% of 1064.60MB total Dropped 215 nodes (cum <= 5.32MB) Showing top 10 nodes out of 109 flat flat% sum% cum cum% 534.20MB 50.18% 50.18% 534.20MB 50.18% runtime.malg 95.68MB 8.99% 59.17% 95.68MB 8.99% iota/vendor/github.com/yuin/gopher-lua.newLTable 61.91MB 5.82% 64.98% 90.47MB 8.50% iota/vendor/github.com/yuin/gopher-lua.newFuncContext 50.23MB 4.72% 69.70% 50.23MB 4.72% iota/vendor/github.com/yuin/gopher-lua.newRegistry 34.52MB 3.24% 72.94% 34.52MB 3.24% iota/vendor/github.com/yuin/gopher-lua.(*LTable).RawSetString 33MB 3.10% 76.04% 33MB 3.10% iota/vendor/github.com/eclipse/paho%2emqtt%2egolang.outgoing 31MB 2.91% 78.95% 31MB 2.91% iota/vendor/github.com/eclipse/paho%2emqtt%2egolang.errorWatch 31MB 2.91% 81.87% 31MB 2.91% iota/vendor/github.com/eclipse/paho%2emqtt%2egolang.keepalive 27.06MB 2.54% 84.41% 27.06MB 2.54% iota/vendor/github.com/yuin/gopher-lua.newFunctionProto (inline) 14.50MB 1.36% 85.77% 14.50MB 1.36% iota/vendor/github.com/eclipse/paho%2emqtt%2egolang.alllogic ``` > 列出消耗最大的部分 top > > 列出函数代码以及对应的取样数据 list > > 汇编代码以及对应的取样数据 disasm > > web命令生成svg图 在服务器上执行go tool pprof后生成profile文件,拷贝到本机windows机器,执行 ![image-20211116103902511](imgs/UCloud-DAC上云测试/image-20211116103902511.png) > 安装 graphviz > > https://graphviz.gitlab.io/_pages/Download/Download_windows.html > > 下载zip解压配置系统环境变量 > > ```sh > C:\Users\yww08>dot -version > dot - graphviz version 2.45.20200701.0038 (20200701.0038) > There is no layout engine support for "dot" > Perhaps "dot -c" needs to be run (with installer's privileges) to register the plugins? > ``` > ```sh > 执行dot初始化 > > dot -c > ``` 本机执行pprof ```sh go tool pprof --http=:8080 pprof.dac.alloc_objects.alloc_space.inuse_objects.inuse_space.003.pb.gz ``` !["sss"](imgs/UCloud-DAC上云测试/image-20211116112452820.png) 内存的占用主要集中在: runtime malg 去搜寻了大量资料之后,发现go的官网早就有这个issue(官方issue),大佬们知道,只是不好解决,描述如下: Your observation is correct. Currently the runtime never frees the g objects created for goroutines, though it does reuse them. The main reason for this is that the scheduler often manipulates g pointers without write barriers (a lot of scheduler code runs without a P, and hence cannot have write barriers), and this makes it very hard to determine when a g can be garbage collected. 大致原因就是go的gc采用的是并发垃圾回收,调度器在操作协程指针的时候不使用写屏障(可以看看draveness大佬的分析),因为调度器在很多执行的时候需要使用P(GPM),因此不能使用写屏障,所以调度器很难确定一个协程是否可以当成垃圾回收,这样调度器里的协程指针信息就会泄露。 ———————————————— 版权声明:本文为CSDN博主「wuyuhao13579」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。 原文链接:https://blog.csdn.net/wuyuhao13579/article/details/109079570 找进程的日志: 发现出问题的DAC日志重复出现 ```sh Loss connection ``` 这是DAC代码中mqtt断连的时候触发的日志。查看源码: ```go func (d *Mqtt) Connect() (err error) { //TODO not safe d.setConnStat(statInit) //decode //set opts opts := pahomqtt.NewClientOptions().AddBroker(d.config.URL) opts.SetClientID(d.config.ClientID) opts.SetCleanSession(d.config.CleanSessionFlag) opts.SetKeepAlive(time.Second * time.Duration(d.config.KeepAlive)) // 30s opts.SetPingTimeout(time.Second * time.Duration(d.config.KeepAlive*2)) opts.SetConnectionLostHandler(func(c pahomqtt.Client, err error) { // mqtt连接掉线时的回调函数 log.Debug("[Mqtt] Loss connection, %s %v", err, d.config) d.terminateFlag <- true //d.Reconnect() }) } ``` ## 对象存储(OSS) 阿里云 OSS基础概念 https://help.aliyun.com/document_detail/31827.html