Thursday, 2013-11-21

*** Troy__ (329b9bb1@gateway/web/freenode/ip.50.155.155.177) has joined #wikid01:41
*** mark___ (8f74745b@gateway/web/freenode/ip.143.116.116.91) has joined #wikid01:41
mark___We have a major issue with Wikid is Nick around?01:44
*** nowen (~nowen@172.56.4.48) has joined #wikid01:45
nowenhey01:45
mark___Can you join a call please01:46
mark___we had another failure01:47
mark___i sent the call in information to you via email01:47
nowensure give me a number01:47
nowenok01:47
mark___thank you01:48
mark___let me know when you have joined01:48
*** nowen has quit (Quit: qicr for android: faster and better)02:00
*** mark___ has quit (Ping timeout: 250 seconds)03:10
*** nowen (~nowen@99-174-92-191.lightspeed.tukrga.sbcglobal.net) has joined #wikid03:12
nowenTroy__: you still here?05:31
Troy__yes. i'm here05:33
nowenI am so sorry05:34
Troy__it's ok.. now that we have so many users using 2FA, it's become a very critical app05:34
nowenand it should be reliable05:35
nowenas far as root cause goes, I'm not sure about the app telling the db to shut down.05:35
Troy__and I don't feel we were quite ready to go for that increase in that short amount of time05:35
nowenI should have had you guys vacuuming the db and being more aggressive on that05:36
nowenthis reminds me of the last replication bug we had.  log page slowed and then the server froze. I think it slows but doesn't die.05:37
nowenI am very hopeful the db is much better05:37
nowenin terms of root cause, I think it was db bloat and lack of vacuuming05:39
nowenwhat do you need from me?05:40
Troy__yes.. i'd like to get a regular scheduled maintenance setup to avoid any db issues.05:41
Troy__do you have the location of the log archives?05:41
nowennot off the top of my head, but 'locate *.log' should find them05:42
nowenso, I would like to get the db size and the 2 timestamps daily05:42
nowenwe'll get you a script that will auto-archive, but I would like to hand hold it a bit to make sure we keep an eye on things too05:43
nowenok - time for sleep05:49
*** nowen has quit (Quit: Leaving.)05:56
*** Troy__ has quit (Quit: Page closed)06:36
*** nowen (~nowen@99-174-92-191.lightspeed.tukrga.sbcglobal.net) has joined #wikid14:29
*** nowen has quit (Client Quit)14:31
*** mark___ (8f74745b@gateway/web/freenode/ip.143.116.116.91) has joined #wikid14:53
mark___Nick are you here?14:54
*** nowen (~nowen@99-174-92-191.lightspeed.tukrga.sbcglobal.net) has joined #wikid15:22
nowenmorning mark___15:22
*** mark___ has quit (Quit: Page closed)15:25
*** Troy__ (329b9bb1@gateway/web/freenode/ip.50.155.155.177) has joined #wikid16:28
Troy__Good morning Nick16:28
Troy__we are still running fine this morning16:28
nowenMorning16:28
nowengood to hear16:28
nowendid you check the db size?16:28
Troy__I just wanted to verify the max timestamp command you gave us16:29
Troy__psql -h 127.0.0.1 -p 5434 -d wikid -U postgres -c "select max(timestamp) from logging_event"16:29
nowenyes - when run on either server it will give you the local db's last logging event16:30
Troy__what are the units used?.. when I run this they seem to be off a bit between port 5434 and 543616:30
nowenit's epoch time16:31
nowenhttp://www.epochconverter.com/16:31
Troy__-bash-3.2$ psql -h 127.0.0.1 -p 5434 -d wikid -U postgres -c "select max(timestamp) from logging_event"       max ---------------  1385051458991 (1 row)16:31
Troy__-bash-3.2$ psql -h 127.0.0.1 -p 5436 -d wikid -U postgres -c "select max(timestamp) from logging_event"       max ---------------  1385051497282 (1 row)16:31
nowenit is in milli seconds16:31
Troy__ok. that could be the reason16:32
Troy__that's what I expected, but wanted to confirm16:32
nowenthere are some changes you can make to sshd_config to make ssh connections faster16:32
nowen#UseDNS yes UseDNS no16:33
nowenGSSAPIAuthentication no #GSSAPIAuthentication yes16:33
nowenso, turn off dns and GSSAPI.  we may have done this a while back, but worth checking16:33
Troy__ok.. I remember seeing that in the documentation16:34
Troy__I'll pass that on to the admin16:34
nowenhttp://www.wikidsystems.com/support/wikid-support-center/installation-how-tos/how-to-configure-the-wikid-strong-authentication-system-for-replication at the bottom16:34
Troy__will changing the UseDNS to no have any affect?  probably not since it's using IP right?16:34
nowenIt seems to16:34
Troy__it doesn't seem that we are off by more than a few seconds really16:34
Troy__i would be concerned if the delta would be greater than an hour16:36
Troy__that would probably indicate we lost connection to the slave16:36
nowenare the clocks on both the same?16:37
Troy__yes..I believe they are getting updated by NTP on a regular basis16:38
Troy__not sure of the frequency.. but I'll check16:38
nowenthat would be good16:39
nowenTroy__: can you check /etc/ssh/sshd_config  too and check those settings?16:40
Troy__is there anything in particular you want to check in sshd_config?16:42
nowenyes - if UseDNS is set to no16:43
Troy__looks like it's commented out.. #UseDNS yes16:44
Troy__so I'm not sure if default setting is no16:45
nowenhow about GSSAPIAuthentication?16:45
Troy__GSSAPIAuthentication yes16:45
Troy__so we would want to change both to no16:46
nowenyes.  if you run a traceroute between the two, I suspect you will see a latency of around 500ms16:47
nowenso, that should be your db delay16:47
nowenthe config needs to change and sshd restarted.  I might not sweat it under normal circumstances, but I felt the same way about vacuumdb :-)16:48
Troy__ok.. i'll ask the Data center guys to look into changing that configuratoin16:51
nowenon both16:51
Troy__yes16:51
nowenok - how's the db size this am?16:51
Troy__i haven't check yet they morning.. give me a second and I'll run it16:51
nowenok16:52
Troy__I know when sshd is restart, we would temporarily lose the ssh connection.. would WiKID recover by retrying the connection and catch up or would we have to restart wikid again?16:56
Troy__wikid=# SELECT pg_size_pretty(pg_database_size('wikid')) As fulldbsize;  fulldbsize ------------  68 MB16:57
Troy__so we are not much larger16:57
nowenI think it would recover.16:58
Troy__i think Vince said we were about 50MB last night16:58
nowenyeah, I would think that there would be some pop at the start16:58
nowenTroy__: can you go ahead and run traceroute between the two servers?17:00
Troy__were you able to find the command or a way to archive the logs via shell so we can automate this process?17:00
nowenworking on that17:00
Troy__I would like it to be done weekly17:00
nowenTroy__: it's possible that transactions that occur during the ssh shutdown would be lost.17:01
Troy__ok.. like you said last night, I think weekly archive logs then run db vacuum full is best to keep this from growing too large17:01
nowenthat is - not pushed to the slave17:02
nowenTroy__: yes17:02
Troy__ok.. most likely if it's done at the same time we would maybe miss a few logs entries17:02
Troy__i'll try the traceroute now17:02
nowenI defer to you guys, but I do want everything sparkly17:03
Troy__-bash-3.2$ sudo traceroute 148.164.251.59 traceroute to 148.164.251.59 (148.164.251.59), 30 hops max, 40 byte packets  1  143.116.32.251 (143.116.32.251)  0.488 ms  0.554 ms  0.613 ms  2  143.116.27.124 (143.116.27.124)  0.463 ms  0.504 ms  0.564 ms  3  143.116.161.185 (143.116.161.185)  1.198 ms  1.299 ms  1.379 ms  4  143.116.160.249 (143.116.160.249)  67.491 ms  69.291 ms  70.129 ms  5  148.164.174.249 (148.164.174.249)  66.492 ms 17:06
Troy__from HSV to SJC17:06
Troy__the replication is done by root or wikid user?17:06
nowenhmm - do you have a wikid user?  I thought we did that after 121617:07
Troy__yes.. wikid is what we are using to start and stop17:08
nowenok.  normally root - but check which user is running usogres17:10
Troy__yes.. root is running usogres17:20
Troy__I see some HTTP access logger entries like the following in the logs:17:54
Troy__2013-11-21 11:34:47.405WARNHTTP Access Logger172.16.188.95 - - "GET /wikid/webstart/jw.properties.pack.gz HTTP/1.1" 404 34417:54
nowenhmm - are you'll using the web start token?17:54
Troy__is this just a client launching or requesting a OTP via the webstart client?17:54
Troy__yes.. we are rolling out the full wikid client, but many still have the web start client still17:55
Troy__so there is a mixture17:55
nowenthere's an option to compress the webstart file. I guess it checks for it first.17:57
nowenthe gssapi setting is probably more bang for the buck if you don't want to fight the dns battle.17:57
nowenalso, you can create an entry in /etc/hosts on each server for each other17:58
Troy__ok. we are scheduled to talk about the change tomorrow.. but they don't think it's a problem to update.18:01
Troy__yes18:02
*** nowen has quit (Read error: Connection reset by peer)18:03
*** nowen (~nowen@99-174-92-191.lightspeed.tukrga.sbcglobal.net) has joined #wikid18:04
*** Troy__ has quit (Ping timeout: 250 seconds)20:18
*** Troy__ (329b9bb1@gateway/web/freenode/ip.50.155.155.177) has joined #wikid20:24
*** KORG has quit (Read error: Connection reset by peer)21:35
*** nowen has quit (Quit: Leaving.)23:04

Generated by irclog2html.py 2.11.0 by Marius Gedminas - find it at mg.pov.lt!