Thursday, 2013-11-21

*** Troy__ (329b9bb1@gateway/web/freenode/ip.50.155.155.177) has joined #wikid		01:41
*** mark___ (8f74745b@gateway/web/freenode/ip.143.116.116.91) has joined #wikid		01:41
mark___	We have a major issue with Wikid is Nick around?	01:44
*** nowen (~nowen@172.56.4.48) has joined #wikid		01:45
nowen	hey	01:45
mark___	Can you join a call please	01:46
mark___	we had another failure	01:47
mark___	i sent the call in information to you via email	01:47
nowen	sure give me a number	01:47
nowen	ok	01:47
mark___	thank you	01:48
mark___	let me know when you have joined	01:48
*** nowen has quit (Quit: qicr for android: faster and better)		02:00
*** mark___ has quit (Ping timeout: 250 seconds)		03:10
*** nowen (~nowen@99-174-92-191.lightspeed.tukrga.sbcglobal.net) has joined #wikid		03:12
nowen	Troy__: you still here?	05:31
Troy__	yes. i'm here	05:33
nowen	I am so sorry	05:34
Troy__	it's ok.. now that we have so many users using 2FA, it's become a very critical app	05:34
nowen	and it should be reliable	05:35
nowen	as far as root cause goes, I'm not sure about the app telling the db to shut down.	05:35
Troy__	and I don't feel we were quite ready to go for that increase in that short amount of time	05:35
nowen	I should have had you guys vacuuming the db and being more aggressive on that	05:36
nowen	this reminds me of the last replication bug we had. log page slowed and then the server froze. I think it slows but doesn't die.	05:37
nowen	I am very hopeful the db is much better	05:37
nowen	in terms of root cause, I think it was db bloat and lack of vacuuming	05:39
nowen	what do you need from me?	05:40
Troy__	yes.. i'd like to get a regular scheduled maintenance setup to avoid any db issues.	05:41
Troy__	do you have the location of the log archives?	05:41
nowen	not off the top of my head, but 'locate *.log' should find them	05:42
nowen	so, I would like to get the db size and the 2 timestamps daily	05:42
nowen	we'll get you a script that will auto-archive, but I would like to hand hold it a bit to make sure we keep an eye on things too	05:43
nowen	ok - time for sleep	05:49
*** nowen has quit (Quit: Leaving.)		05:56
*** Troy__ has quit (Quit: Page closed)		06:36
*** nowen (~nowen@99-174-92-191.lightspeed.tukrga.sbcglobal.net) has joined #wikid		14:29
*** nowen has quit (Client Quit)		14:31
*** mark___ (8f74745b@gateway/web/freenode/ip.143.116.116.91) has joined #wikid		14:53
mark___	Nick are you here?	14:54
*** nowen (~nowen@99-174-92-191.lightspeed.tukrga.sbcglobal.net) has joined #wikid		15:22
nowen	morning mark___	15:22
*** mark___ has quit (Quit: Page closed)		15:25
*** Troy__ (329b9bb1@gateway/web/freenode/ip.50.155.155.177) has joined #wikid		16:28
Troy__	Good morning Nick	16:28
Troy__	we are still running fine this morning	16:28
nowen	Morning	16:28
nowen	good to hear	16:28
nowen	did you check the db size?	16:28
Troy__	I just wanted to verify the max timestamp command you gave us	16:29
Troy__	psql -h 127.0.0.1 -p 5434 -d wikid -U postgres -c "select max(timestamp) from logging_event"	16:29
nowen	yes - when run on either server it will give you the local db's last logging event	16:30
Troy__	what are the units used?.. when I run this they seem to be off a bit between port 5434 and 5436	16:30
nowen	it's epoch time	16:31
nowen	http://www.epochconverter.com/	16:31
Troy__	-bash-3.2$ psql -h 127.0.0.1 -p 5434 -d wikid -U postgres -c "select max(timestamp) from logging_event" max --------------- 1385051458991 (1 row)	16:31
Troy__	-bash-3.2$ psql -h 127.0.0.1 -p 5436 -d wikid -U postgres -c "select max(timestamp) from logging_event" max --------------- 1385051497282 (1 row)	16:31
nowen	it is in milli seconds	16:31
Troy__	ok. that could be the reason	16:32
Troy__	that's what I expected, but wanted to confirm	16:32
nowen	there are some changes you can make to sshd_config to make ssh connections faster	16:32
nowen	#UseDNS yes UseDNS no	16:33
nowen	GSSAPIAuthentication no #GSSAPIAuthentication yes	16:33
nowen	so, turn off dns and GSSAPI. we may have done this a while back, but worth checking	16:33
Troy__	ok.. I remember seeing that in the documentation	16:34
Troy__	I'll pass that on to the admin	16:34
nowen	http://www.wikidsystems.com/support/wikid-support-center/installation-how-tos/how-to-configure-the-wikid-strong-authentication-system-for-replication at the bottom	16:34
Troy__	will changing the UseDNS to no have any affect? probably not since it's using IP right?	16:34
nowen	It seems to	16:34
Troy__	it doesn't seem that we are off by more than a few seconds really	16:34
Troy__	i would be concerned if the delta would be greater than an hour	16:36
Troy__	that would probably indicate we lost connection to the slave	16:36
nowen	are the clocks on both the same?	16:37
Troy__	yes..I believe they are getting updated by NTP on a regular basis	16:38
Troy__	not sure of the frequency.. but I'll check	16:38
nowen	that would be good	16:39
nowen	Troy__: can you check /etc/ssh/sshd_config too and check those settings?	16:40
Troy__	is there anything in particular you want to check in sshd_config?	16:42
nowen	yes - if UseDNS is set to no	16:43
Troy__	looks like it's commented out.. #UseDNS yes	16:44
Troy__	so I'm not sure if default setting is no	16:45
nowen	how about GSSAPIAuthentication?	16:45
Troy__	GSSAPIAuthentication yes	16:45
Troy__	so we would want to change both to no	16:46
nowen	yes. if you run a traceroute between the two, I suspect you will see a latency of around 500ms	16:47
nowen	so, that should be your db delay	16:47
nowen	the config needs to change and sshd restarted. I might not sweat it under normal circumstances, but I felt the same way about vacuumdb :-)	16:48
Troy__	ok.. i'll ask the Data center guys to look into changing that configuratoin	16:51
nowen	on both	16:51
Troy__	yes	16:51
nowen	ok - how's the db size this am?	16:51
Troy__	i haven't check yet they morning.. give me a second and I'll run it	16:51
nowen	ok	16:52
Troy__	I know when sshd is restart, we would temporarily lose the ssh connection.. would WiKID recover by retrying the connection and catch up or would we have to restart wikid again?	16:56
Troy__	wikid=# SELECT pg_size_pretty(pg_database_size('wikid')) As fulldbsize; fulldbsize ------------ 68 MB	16:57
Troy__	so we are not much larger	16:57
nowen	I think it would recover.	16:58
Troy__	i think Vince said we were about 50MB last night	16:58
nowen	yeah, I would think that there would be some pop at the start	16:58
nowen	Troy__: can you go ahead and run traceroute between the two servers?	17:00
Troy__	were you able to find the command or a way to archive the logs via shell so we can automate this process?	17:00
nowen	working on that	17:00
Troy__	I would like it to be done weekly	17:00
nowen	Troy__: it's possible that transactions that occur during the ssh shutdown would be lost.	17:01
Troy__	ok.. like you said last night, I think weekly archive logs then run db vacuum full is best to keep this from growing too large	17:01
nowen	that is - not pushed to the slave	17:02
nowen	Troy__: yes	17:02
Troy__	ok.. most likely if it's done at the same time we would maybe miss a few logs entries	17:02
Troy__	i'll try the traceroute now	17:02
nowen	I defer to you guys, but I do want everything sparkly	17:03
Troy__	-bash-3.2$ sudo traceroute 148.164.251.59 traceroute to 148.164.251.59 (148.164.251.59), 30 hops max, 40 byte packets 1 143.116.32.251 (143.116.32.251) 0.488 ms 0.554 ms 0.613 ms 2 143.116.27.124 (143.116.27.124) 0.463 ms 0.504 ms 0.564 ms 3 143.116.161.185 (143.116.161.185) 1.198 ms 1.299 ms 1.379 ms 4 143.116.160.249 (143.116.160.249) 67.491 ms 69.291 ms 70.129 ms 5 148.164.174.249 (148.164.174.249) 66.492 ms	17:06
Troy__	from HSV to SJC	17:06
Troy__	the replication is done by root or wikid user?	17:06
nowen	hmm - do you have a wikid user? I thought we did that after 1216	17:07
Troy__	yes.. wikid is what we are using to start and stop	17:08
nowen	ok. normally root - but check which user is running usogres	17:10
Troy__	yes.. root is running usogres	17:20
Troy__	I see some HTTP access logger entries like the following in the logs:	17:54
Troy__	2013-11-21 11:34:47.405WARNHTTP Access Logger172.16.188.95 - - "GET /wikid/webstart/jw.properties.pack.gz HTTP/1.1" 404 344	17:54
nowen	hmm - are you'll using the web start token?	17:54
Troy__	is this just a client launching or requesting a OTP via the webstart client?	17:54
Troy__	yes.. we are rolling out the full wikid client, but many still have the web start client still	17:55
Troy__	so there is a mixture	17:55
nowen	there's an option to compress the webstart file. I guess it checks for it first.	17:57
nowen	the gssapi setting is probably more bang for the buck if you don't want to fight the dns battle.	17:57
nowen	also, you can create an entry in /etc/hosts on each server for each other	17:58
Troy__	ok. we are scheduled to talk about the change tomorrow.. but they don't think it's a problem to update.	18:01
Troy__	yes	18:02
*** nowen has quit (Read error: Connection reset by peer)		18:03
*** nowen (~nowen@99-174-92-191.lightspeed.tukrga.sbcglobal.net) has joined #wikid		18:04
*** Troy__ has quit (Ping timeout: 250 seconds)		20:18
*** Troy__ (329b9bb1@gateway/web/freenode/ip.50.155.155.177) has joined #wikid		20:24
*** KORG has quit (Read error: Connection reset by peer)		21:35
*** nowen has quit (Quit: Leaving.)		23:04

Generated by irclog2html.py 2.11.0 by Marius Gedminas - find it at mg.pov.lt!