*** Troy__ (329b9bb1@gateway/web/freenode/ip.50.155.155.177) has joined #wikid | 01:41 | |
*** mark___ (8f74745b@gateway/web/freenode/ip.143.116.116.91) has joined #wikid | 01:41 | |
mark___ | We have a major issue with Wikid is Nick around? | 01:44 |
---|---|---|
*** nowen (~nowen@172.56.4.48) has joined #wikid | 01:45 | |
nowen | hey | 01:45 |
mark___ | Can you join a call please | 01:46 |
mark___ | we had another failure | 01:47 |
mark___ | i sent the call in information to you via email | 01:47 |
nowen | sure give me a number | 01:47 |
nowen | ok | 01:47 |
mark___ | thank you | 01:48 |
mark___ | let me know when you have joined | 01:48 |
*** nowen has quit (Quit: qicr for android: faster and better) | 02:00 | |
*** mark___ has quit (Ping timeout: 250 seconds) | 03:10 | |
*** nowen (~nowen@99-174-92-191.lightspeed.tukrga.sbcglobal.net) has joined #wikid | 03:12 | |
nowen | Troy__: you still here? | 05:31 |
Troy__ | yes. i'm here | 05:33 |
nowen | I am so sorry | 05:34 |
Troy__ | it's ok.. now that we have so many users using 2FA, it's become a very critical app | 05:34 |
nowen | and it should be reliable | 05:35 |
nowen | as far as root cause goes, I'm not sure about the app telling the db to shut down. | 05:35 |
Troy__ | and I don't feel we were quite ready to go for that increase in that short amount of time | 05:35 |
nowen | I should have had you guys vacuuming the db and being more aggressive on that | 05:36 |
nowen | this reminds me of the last replication bug we had. log page slowed and then the server froze. I think it slows but doesn't die. | 05:37 |
nowen | I am very hopeful the db is much better | 05:37 |
nowen | in terms of root cause, I think it was db bloat and lack of vacuuming | 05:39 |
nowen | what do you need from me? | 05:40 |
Troy__ | yes.. i'd like to get a regular scheduled maintenance setup to avoid any db issues. | 05:41 |
Troy__ | do you have the location of the log archives? | 05:41 |
nowen | not off the top of my head, but 'locate *.log' should find them | 05:42 |
nowen | so, I would like to get the db size and the 2 timestamps daily | 05:42 |
nowen | we'll get you a script that will auto-archive, but I would like to hand hold it a bit to make sure we keep an eye on things too | 05:43 |
nowen | ok - time for sleep | 05:49 |
*** nowen has quit (Quit: Leaving.) | 05:56 | |
*** Troy__ has quit (Quit: Page closed) | 06:36 | |
*** nowen (~nowen@99-174-92-191.lightspeed.tukrga.sbcglobal.net) has joined #wikid | 14:29 | |
*** nowen has quit (Client Quit) | 14:31 | |
*** mark___ (8f74745b@gateway/web/freenode/ip.143.116.116.91) has joined #wikid | 14:53 | |
mark___ | Nick are you here? | 14:54 |
*** nowen (~nowen@99-174-92-191.lightspeed.tukrga.sbcglobal.net) has joined #wikid | 15:22 | |
nowen | morning mark___ | 15:22 |
*** mark___ has quit (Quit: Page closed) | 15:25 | |
*** Troy__ (329b9bb1@gateway/web/freenode/ip.50.155.155.177) has joined #wikid | 16:28 | |
Troy__ | Good morning Nick | 16:28 |
Troy__ | we are still running fine this morning | 16:28 |
nowen | Morning | 16:28 |
nowen | good to hear | 16:28 |
nowen | did you check the db size? | 16:28 |
Troy__ | I just wanted to verify the max timestamp command you gave us | 16:29 |
Troy__ | psql -h 127.0.0.1 -p 5434 -d wikid -U postgres -c "select max(timestamp) from logging_event" | 16:29 |
nowen | yes - when run on either server it will give you the local db's last logging event | 16:30 |
Troy__ | what are the units used?.. when I run this they seem to be off a bit between port 5434 and 5436 | 16:30 |
nowen | it's epoch time | 16:31 |
nowen | http://www.epochconverter.com/ | 16:31 |
Troy__ | -bash-3.2$ psql -h 127.0.0.1 -p 5434 -d wikid -U postgres -c "select max(timestamp) from logging_event" max --------------- 1385051458991 (1 row) | 16:31 |
Troy__ | -bash-3.2$ psql -h 127.0.0.1 -p 5436 -d wikid -U postgres -c "select max(timestamp) from logging_event" max --------------- 1385051497282 (1 row) | 16:31 |
nowen | it is in milli seconds | 16:31 |
Troy__ | ok. that could be the reason | 16:32 |
Troy__ | that's what I expected, but wanted to confirm | 16:32 |
nowen | there are some changes you can make to sshd_config to make ssh connections faster | 16:32 |
nowen | #UseDNS yes UseDNS no | 16:33 |
nowen | GSSAPIAuthentication no #GSSAPIAuthentication yes | 16:33 |
nowen | so, turn off dns and GSSAPI. we may have done this a while back, but worth checking | 16:33 |
Troy__ | ok.. I remember seeing that in the documentation | 16:34 |
Troy__ | I'll pass that on to the admin | 16:34 |
nowen | http://www.wikidsystems.com/support/wikid-support-center/installation-how-tos/how-to-configure-the-wikid-strong-authentication-system-for-replication at the bottom | 16:34 |
Troy__ | will changing the UseDNS to no have any affect? probably not since it's using IP right? | 16:34 |
nowen | It seems to | 16:34 |
Troy__ | it doesn't seem that we are off by more than a few seconds really | 16:34 |
Troy__ | i would be concerned if the delta would be greater than an hour | 16:36 |
Troy__ | that would probably indicate we lost connection to the slave | 16:36 |
nowen | are the clocks on both the same? | 16:37 |
Troy__ | yes..I believe they are getting updated by NTP on a regular basis | 16:38 |
Troy__ | not sure of the frequency.. but I'll check | 16:38 |
nowen | that would be good | 16:39 |
nowen | Troy__: can you check /etc/ssh/sshd_config too and check those settings? | 16:40 |
Troy__ | is there anything in particular you want to check in sshd_config? | 16:42 |
nowen | yes - if UseDNS is set to no | 16:43 |
Troy__ | looks like it's commented out.. #UseDNS yes | 16:44 |
Troy__ | so I'm not sure if default setting is no | 16:45 |
nowen | how about GSSAPIAuthentication? | 16:45 |
Troy__ | GSSAPIAuthentication yes | 16:45 |
Troy__ | so we would want to change both to no | 16:46 |
nowen | yes. if you run a traceroute between the two, I suspect you will see a latency of around 500ms | 16:47 |
nowen | so, that should be your db delay | 16:47 |
nowen | the config needs to change and sshd restarted. I might not sweat it under normal circumstances, but I felt the same way about vacuumdb :-) | 16:48 |
Troy__ | ok.. i'll ask the Data center guys to look into changing that configuratoin | 16:51 |
nowen | on both | 16:51 |
Troy__ | yes | 16:51 |
nowen | ok - how's the db size this am? | 16:51 |
Troy__ | i haven't check yet they morning.. give me a second and I'll run it | 16:51 |
nowen | ok | 16:52 |
Troy__ | I know when sshd is restart, we would temporarily lose the ssh connection.. would WiKID recover by retrying the connection and catch up or would we have to restart wikid again? | 16:56 |
Troy__ | wikid=# SELECT pg_size_pretty(pg_database_size('wikid')) As fulldbsize; fulldbsize ------------ 68 MB | 16:57 |
Troy__ | so we are not much larger | 16:57 |
nowen | I think it would recover. | 16:58 |
Troy__ | i think Vince said we were about 50MB last night | 16:58 |
nowen | yeah, I would think that there would be some pop at the start | 16:58 |
nowen | Troy__: can you go ahead and run traceroute between the two servers? | 17:00 |
Troy__ | were you able to find the command or a way to archive the logs via shell so we can automate this process? | 17:00 |
nowen | working on that | 17:00 |
Troy__ | I would like it to be done weekly | 17:00 |
nowen | Troy__: it's possible that transactions that occur during the ssh shutdown would be lost. | 17:01 |
Troy__ | ok.. like you said last night, I think weekly archive logs then run db vacuum full is best to keep this from growing too large | 17:01 |
nowen | that is - not pushed to the slave | 17:02 |
nowen | Troy__: yes | 17:02 |
Troy__ | ok.. most likely if it's done at the same time we would maybe miss a few logs entries | 17:02 |
Troy__ | i'll try the traceroute now | 17:02 |
nowen | I defer to you guys, but I do want everything sparkly | 17:03 |
Troy__ | -bash-3.2$ sudo traceroute 148.164.251.59 traceroute to 148.164.251.59 (148.164.251.59), 30 hops max, 40 byte packets 1 143.116.32.251 (143.116.32.251) 0.488 ms 0.554 ms 0.613 ms 2 143.116.27.124 (143.116.27.124) 0.463 ms 0.504 ms 0.564 ms 3 143.116.161.185 (143.116.161.185) 1.198 ms 1.299 ms 1.379 ms 4 143.116.160.249 (143.116.160.249) 67.491 ms 69.291 ms 70.129 ms 5 148.164.174.249 (148.164.174.249) 66.492 ms | 17:06 |
Troy__ | from HSV to SJC | 17:06 |
Troy__ | the replication is done by root or wikid user? | 17:06 |
nowen | hmm - do you have a wikid user? I thought we did that after 1216 | 17:07 |
Troy__ | yes.. wikid is what we are using to start and stop | 17:08 |
nowen | ok. normally root - but check which user is running usogres | 17:10 |
Troy__ | yes.. root is running usogres | 17:20 |
Troy__ | I see some HTTP access logger entries like the following in the logs: | 17:54 |
Troy__ | 2013-11-21 11:34:47.405WARNHTTP Access Logger172.16.188.95 - - "GET /wikid/webstart/jw.properties.pack.gz HTTP/1.1" 404 344 | 17:54 |
nowen | hmm - are you'll using the web start token? | 17:54 |
Troy__ | is this just a client launching or requesting a OTP via the webstart client? | 17:54 |
Troy__ | yes.. we are rolling out the full wikid client, but many still have the web start client still | 17:55 |
Troy__ | so there is a mixture | 17:55 |
nowen | there's an option to compress the webstart file. I guess it checks for it first. | 17:57 |
nowen | the gssapi setting is probably more bang for the buck if you don't want to fight the dns battle. | 17:57 |
nowen | also, you can create an entry in /etc/hosts on each server for each other | 17:58 |
Troy__ | ok. we are scheduled to talk about the change tomorrow.. but they don't think it's a problem to update. | 18:01 |
Troy__ | yes | 18:02 |
*** nowen has quit (Read error: Connection reset by peer) | 18:03 | |
*** nowen (~nowen@99-174-92-191.lightspeed.tukrga.sbcglobal.net) has joined #wikid | 18:04 | |
*** Troy__ has quit (Ping timeout: 250 seconds) | 20:18 | |
*** Troy__ (329b9bb1@gateway/web/freenode/ip.50.155.155.177) has joined #wikid | 20:24 | |
*** KORG has quit (Read error: Connection reset by peer) | 21:35 | |
*** nowen has quit (Quit: Leaving.) | 23:04 |
Generated by irclog2html.py 2.11.0 by Marius Gedminas - find it at mg.pov.lt!