*** nowen has quit (Quit: Bye) | 00:29 | |
*** rfxn (~teck7@bas1-montreal54-1167956021.dsl.bell.ca) has joined #wikid | 07:11 | |
*** nowen (~nowen@50-194-249-125-static.hfc.comcastbusiness.net) has joined #wikid | 13:43 | |
*** Troy_ (329b98a8@gateway/web/freenode/ip.50.155.152.168) has joined #wikid | 16:14 | |
nowen | Hi Troy_ | 16:14 |
---|---|---|
Troy_ | Good morning | 16:14 |
*** Troy_ is now known as Guest64590 | 16:14 | |
Guest64590 | i appologize for the delayed responses yesterday.. it was my sons b-day and got busy | 16:15 |
nowen | no problem. just wanted to make sure that I hadn't dropped a ball | 16:15 |
nowen | any ideas on the cause? | 16:16 |
Guest64590 | no.. not yet.. just seemed to get stuck early Sunday morning.. issued OTPs but wasn't able to able to do much of anything else | 16:18 |
Guest64590 | once I shutdown the primary server, the failover scripts took over and we were back up on the secondary | 16:18 |
Guest64590 | now I can't get the primary back up | 16:18 |
Guest64590 | did you see anything from the pgsql logs? | 16:19 |
nowen | not really. | 16:22 |
nowen | can you start postgres via the service command? | 16:22 |
Guest64590 | i have not attempted that yet.. i was waiting on Vince (db admin) before I did anything else.. but I haven't been able to reach him this morning yet | 16:23 |
nowen | and there was nothing in the tomcat logs? | 16:26 |
Guest64590 | I just sent the last few entries to you | 16:30 |
Guest64590 | I don't see anything from the recent startup attemp | 16:30 |
Guest64590 | attempt | 16:30 |
nowen | yeah, but that error could point to the failure. the time seems about right, right? | 16:32 |
Guest64590 | yes.. actually I believe these latest entries would be about an hour or so after we saw failures | 16:35 |
Guest64590 | I'm going back a bit to see what is going on earlier | 16:35 |
Guest64590 | actually from the Wikid logs, the last verified creds happened at about 6:09AM.. but users got OTPs up until I stopped the service at about 7:40Am | 16:37 |
nowen | did you'll ever upgrade your WiKID version? | 16:49 |
nowen | Troy: Can you reboot the server? I think we should archive/mv the postgres log and see if it is complaining on a fresh restart | 16:59 |
Guest64590 | no.. we are still at wikid-server-enterprise-3.4.87-b1216 | 17:00 |
Guest64590 | yes.. i will see about rebooting this server.. hopefuly that will clear up anything that is hung up | 17:01 |
Guest64590 | the server is rebooting now | 17:19 |
nowen | ok | 17:19 |
nowen | is wikid set to autostart? | 17:19 |
Guest64590 | no | 17:22 |
nowen | ok | 17:22 |
Guest64590 | one question.. on the setup.conf.. is the port different for standalone? | 17:22 |
nowen | just mv the pgstartup.log anywhere lese | 17:22 |
nowen | yes, but if you set the role to none and then run setup and set it to none again, it will change everything. | 17:23 |
Guest64590 | postgres_port=5434 or 5432 ? | 17:23 |
nowen | 5432 | 17:23 |
Guest64590 | ok | 17:23 |
nowen | for stand alone | 17:23 |
Guest64590 | ok.. i move pgstartup.log to /tmp and ran wikidctl setup, set to none | 17:26 |
Guest64590 | attempting start now | 17:26 |
nowen | ok | 17:27 |
Guest64590 | ok.. that worked! | 17:27 |
nowen | ok | 17:28 |
Guest64590 | now I just need to fail back over | 17:28 |
nowen | so, one thing I wanted to talk about was how we designed the failover. | 17:28 |
Guest64590 | ok | 17:28 |
nowen | we developed it for a customer that was neutral on which server was primary | 17:29 |
nowen | so, the fastest recovery was to make the old primary the secondary, sync the updates and restart | 17:29 |
nowen | if you'll want to keep the primary the primary, we can write a script that will facilitate the fail back | 17:31 |
nowen | but it will still take longer than switching | 17:31 |
nowen | does that make sense? | 17:31 |
Guest64590 | ok. makes sense.. we would need to adjust out failover scripts for this happen the way you intended | 17:31 |
nowen | also, we're going to look into upgrading postgres to a more recent version which I think will have some good performance improvements for you guys | 17:32 |
Guest64590 | that would be awesome. i need to schedule a time for upgrading to the lastest Wikid version | 17:33 |
Guest64590 | sounds like we would need to re-create all the local and network client certificates.. so it may take some time | 17:34 |
nowen | yeah - will you talk to Vince about it and see how he feels? definitely want him on board and if he has any preferences I'd like to know | 17:34 |
nowen | I don't think so | 17:34 |
nowen | only if you change the IP addresses | 17:34 |
Guest64590 | you mentioned something a few months back about updating the certificates for one the later builds | 17:35 |
nowen | oh, yes | 17:35 |
nowen | that's right. We have a new intermediate ca. | 17:35 |
Guest64590 | i don't recall which build updated the int CA..b ut i remember you saying we would need to re-generate the cert chain | 17:37 |
nowen | 3.5.0 b1421 updated the cert | 17:37 |
nowen | there's also an update for the utilities rpm. so you have to do both | 17:38 |
Guest64590 | ok | 17:40 |
*** Guest64590 has quit (Ping timeout: 272 seconds) | 20:06 | |
*** Troy_ (329b98a8@gateway/web/freenode/ip.50.155.152.168) has joined #wikid | 20:54 | |
*** Troy_ is now known as Guest34801 | 20:54 | |
nowen | Troy: any update? | 20:56 |
Guest34801 | Not really. I'm getting ready to setup the replication back on the old primary again | 21:08 |
Guest34801 | Nick: when i run the setup replication the other direction.. from the new primary back to the old primary (setup as slave). do I need to enter the root password to transfer the key? | 21:56 |
Guest34801 | I believe I already transferred the keys one time before in this direction | 21:56 |
nowen | yes | 21:56 |
nowen | really? | 21:57 |
nowen | oh, yeah, it is probably doing it again | 21:57 |
Guest34801 | yes.. when we had that last downtime | 21:57 |
nowen | I'm not sure, I always just enter it | 21:57 |
Guest34801 | ok.. i don't have it.. so I'll have to have the DC guys help me out | 21:57 |
nowen | hmm | 21:58 |
Guest34801 | Nick: I just tried the wikidctl sync from the new primary to the old primary and got some timeout error | 22:36 |
nowen | hmm | 22:36 |
nowen | can you ssh to the box? | 22:36 |
Guest34801 | but I was able to setup the replication in the reverse fine | 22:36 |
nowen | take a look at /opt/WiKID/private on both boxes - do you see the replication keys? | 22:37 |
Guest34801 | yes.. i was able to ssh as wikid user | 22:37 |
nowen | how about as root? | 22:37 |
Guest34801 | checking root | 22:41 |
nowen | I'm thinking that it is something with keys. it's just ssh, so it should be fst | 22:41 |
Guest34801 | yes.. i'm able to ssh as root both ways | 22:41 |
nowen | try again, maybe it was a blip | 22:42 |
Guest34801 | and I see the replication keys in the /private folder | 22:42 |
Guest34801 | i did test twice.. let me check a few other things first | 22:42 |
Guest34801 | what I did is the following: I stopped both servers, setup the slave (old primary) and the master (old slave).. then ran wikidctrl sync selecting d for database only | 22:50 |
Guest34801 | didn't get any prompt to enter the root of the slave | 22:51 |
Guest34801 | I see the replication.ssh in the private folder on each server | 22:51 |
nowen | it asked you for the password when you did the setup? | 22:51 |
nowen | do you see replication.ssh.pub too> | 22:52 |
nowen | ? | 22:52 |
Guest34801 | I got the While talking to slave server: Timeout connecting to xxx.xxx.xxx.xxx | 22:53 |
Guest34801 | everything looks fine otherwise | 22:53 |
Guest34801 | no.. it never ask me this time | 22:53 |
nowen | do you see replication.ssh.pub too? there should be two keys | 22:54 |
Guest34801 | yes.. i see that file on both servers | 22:54 |
nowen | dates and size are the same? | 22:54 |
Guest34801 | then i went ahead and started the slave and then the master | 22:54 |
Guest34801 | and replication seemed to start fine | 22:54 |
Guest34801 | i'm checking the db timestamps | 22:55 |
nowen | I'm checking here and its syncing fine on mine | 22:55 |
nowen | check /var/log/secure on the new slave | 22:55 |
Guest34801 | time stamps are good | 22:56 |
Guest34801 | Jan 6 16:34:09 hsvwikidp1 sshd[15371]: Accepted publickey for root from x.x.x.x port 40675 ssh2 | 22:58 |
nowen | hmm, I know it's not your pipe. | 22:58 |
nowen | you think the timestamps mean that it worked, but failed at the close? | 22:59 |
Guest34801 | strange.. well.. we can leave it this way until tomorrow | 22:59 |
nowen | is the new primary still up? | 22:59 |
Guest34801 | yes.. we are working fine for now.... we would like to get back to the other way since we have the automatic failover scripts in place for the old primary | 23:00 |
Guest34801 | and from the timestamps i just ran, it looks like replication is working | 23:00 |
nowen | yeah, could be the ssh tunnel was slow to close or something | 23:01 |
nowen | ok - I'll be here tomorrow. | 23:25 |
*** nowen has quit (Quit: Leaving.) | 23:26 | |
*** Guest34801 has quit (Quit: Page closed) | 23:42 |
Generated by irclog2html.py 2.11.0 by Marius Gedminas - find it at mg.pov.lt!