*** joevano has quit (*.net *.split) | 03:52 | |
*** joevano (~joevano@bzflag/developer/JoeVano) has joined #wikid | 03:53 | |
*** nowen (~nowen@50-194-249-125-static.hfc.comcastbusiness.net) has joined #wikid | 13:11 | |
*** tessellare (~tessellar@38.88.11.237) has joined #wikid | 13:25 | |
*** tessellare has parted #wikid (None) | 13:25 | |
*** mark___ (8f74745b@gateway/web/freenode/ip.143.116.116.91) has joined #wikid | 13:51 | |
mark___ | Nick are you here | 13:51 |
---|---|---|
mark___ | we have an issue with our wikid | 13:52 |
nowen | is it the iphone token? | 13:52 |
mark___ | no | 13:52 |
mark___ | our db crashed | 13:52 |
nowen | what happened? | 13:53 |
nowen | can you restart wikid? | 13:53 |
mark___ | our db filled up the drive and now over 4500 of the users are missing | 13:53 |
mark___ | no it will not let us restart | 13:53 |
mark___ | gives us a space error | 13:54 |
nowen | ok | 13:54 |
nowen | can you run some commands now? | 13:54 |
mark___ | sure Troy is joining so hang on one moment | 13:55 |
*** Troy (329b9bb1@gateway/web/freenode/ip.50.155.155.177) has joined #wikid | 13:55 | |
Troy | Hi | 13:56 |
nowen | hi | 13:56 |
nowen | sorry to hear about the issue | 13:56 |
nowen | i have some postgres commands I want you to run | 13:56 |
nowen | is the disk 100% full? | 13:56 |
nowen | run 'yum clean all' first to see if that helps | 13:56 |
mark___ | now we have 9% free | 13:58 |
nowen | ok | 13:58 |
nowen | now: su - postgres | 13:58 |
nowen | $ psql -d wikid | 13:58 |
nowen | wikid=# SELECT pg_size_pretty(pg_database_size('wikid')) As fulldbsize; | 13:59 |
mark___ | should we start the server again? | 13:59 |
nowen | I pasted the prompt there too | 13:59 |
nowen | no, not yet | 13:59 |
nowen | that command will show the db size | 13:59 |
nowen | actually, right, you might need to start postgres | 14:00 |
nowen | 'service postgresql start' i think | 14:00 |
nowen | no need to start all of wikid yet | 14:00 |
nowen | we will vacuum the db and that should get us more space | 14:00 |
mark___ | what is the command for starting postgres | 14:01 |
nowen | service postgresql start | 14:01 |
Troy | says service command not found | 14:02 |
mark___ | command service not found | 14:02 |
nowen | is this ubuntu? | 14:03 |
nowen | or centos? | 14:03 |
nowen | are you root? | 14:04 |
Troy | redhat | 14:04 |
Troy | postgres -D /usr/local/pgsql/data | 14:04 |
Troy | i'm su to postrgres | 14:04 |
nowen | oh - sorry - back out to root and start it | 14:04 |
nowen | once started, su to postgres again | 14:06 |
nowen | are you'll in replication mode? | 14:06 |
mark___ | okay it started | 14:08 |
Troy | yes | 14:08 |
Troy | should I config back to standalone? | 14:09 |
nowen | not yet, I think | 14:09 |
Troy | ok | 14:10 |
nowen | as postgres, if you run 'psql -d wikid -p 5434' do you get a psql prompt? | 14:11 |
Troy | yes | 14:13 |
nowen | ok | 14:13 |
nowen | run 'SELECT pg_size_pretty(pg_database_size('wikid')) As fulldbsize;' | 14:14 |
nowen | and tell me how big it is | 14:14 |
Troy | it didn't come back with anything after that command | 14:15 |
Troy | does it dump to a file? | 14:15 |
nowen | no it should come back | 14:16 |
nowen | you have a ; on the end? | 14:16 |
nowen | and you dropped the quotes? | 14:16 |
nowen | wikid=# SELECT pg_size_pretty(pg_database_size('wikid')) As fulldbsize; | 14:17 |
nowen | fulldbsize | 14:17 |
nowen | ------------ | 14:17 |
nowen | 15 MB | 14:17 |
nowen | (1 row) | 14:17 |
Troy | psql -d wikid -p 5432 or 5434 ? | 14:17 |
Troy | how to I exit and try the other port? | 14:17 |
nowen | ctrl d | 14:18 |
nowen | should be 5434 in replication | 14:18 |
Troy | yes.. doesn't connect on that other port | 14:19 |
Troy | just doesn't bring back anything for the db size | 14:19 |
Troy | yes.. i dropped the quotes | 14:19 |
nowen | ko | 14:19 |
nowen | ok | 14:20 |
nowen | so, back as postgres | 14:20 |
Troy | -rw------- 1 root root 0 Oct 27 04:02 spooler.2 -rw------- 1 root root 0 Oct 20 04:02 spooler.3 -rw------- 1 root root 0 Oct 13 04:02 spooler.4 drwxr-x--- 2 squid squid 4096 Feb 16 2010 squid -rw------- 1 root root 0 Sep 13 2011 tallylog -rw-r--r-- 1 root root 16952 Nov 15 07:07 up2date -rw-r--r-- 1 root root 21646 Nov 10 02:49 up2date.1 -rw-r--r-- 1 root root 21906 Nov 3 03:06 u | 14:20 |
Troy | ok | 14:20 |
Troy | wikid=# SELECT pg_size_pretty(pg_database_size('wikid')) As fulldbsize wikid-# SELECT pg_size_pretty(pg_database_size('wikid')) As fulldbsize; | 14:21 |
nowen | did it work? | 14:22 |
nowen | SELECT pg_size_pretty(pg_database_size('wikid')) As fulldbsize wikid-# SELECT pg_size_pretty(pg_database_size('wikid')) As fulldbsize; | 14:22 |
nowen | SELECT pg_size_pretty(pg_database_size('wikid')) As fulldbsize; | 14:24 |
nowen | ? | 14:26 |
Troy | ERROR: syntax error at or near "SELECT" at character 64 LINE 2: SELECT pg_size_pretty(pg_database_size('wikid')) As fulldbsi... | 14:27 |
nowen | hmm | 14:27 |
nowen | maybe something in the cache. try again | 14:27 |
nowen | is there a leading space? | 14:27 |
nowen | nah, that doesn't matter | 14:27 |
nowen | ok - we can skip this if you like. just vacuum the db and see if it helps the disk | 14:29 |
Troy | ok | 14:31 |
Troy | the problem is when it was up it was only showing like 500 users | 14:31 |
mark___ | will that remove any critical data | 14:31 |
nowen | It should not | 14:32 |
nowen | are you sure it is the db taking the space? | 14:32 |
Troy | not sure really..because the secondary server was not full | 14:33 |
Troy | just this one | 14:33 |
Troy | could be logs? | 14:33 |
nowen | was the server running slow? | 14:33 |
Troy | a little.. but mostly just when I restarted. i got the timecop errors | 14:33 |
Troy | then the log in wikid said it was out of space | 14:34 |
nowen | did you have the logs set for debug? | 14:34 |
Troy | no.. it was normal | 14:35 |
nowen | look for any big logs in /var/log | 14:35 |
nowen | also 'locate *.rpm' | 14:35 |
Troy | opt/wikid-utilities-3.0.9-1.i386.rpm /usr/share/doc/mozldap-6.0.5/README.rpm /usr/share/doc/vim-common-7.0.109/Changelog.rpm /var/cache/yum/rhel-i386-server-5/packages/ghostscript-8.70-6.el5.i386.rpm | 14:39 |
Troy | can I check maybe if we have a backup of the db? | 14:40 |
nowen | that's ok. just wanted to make sure that there weren't any big wikid rpms | 14:40 |
nowen | yes - and we can create one | 14:40 |
Troy | then run a vacuum? | 14:40 |
nowen | well - your secondary should be a back up | 14:41 |
Troy | yes.. unfortunately the secondary is the same | 14:44 |
nowen | 'tar -czvf dbbackup.tar.gz /var/lib/pgsql/data/* ' will tar up the db. | 14:44 |
nowen | but I worry there isn't room for it on the server | 14:45 |
nowen | how much space is there? | 14:46 |
Troy | we are restoring from a backup | 14:46 |
*** jrdx (~jrdx@38.88.11.237) has joined #wikid | 14:50 | |
*** jrdx has parted #wikid (None) | 14:50 | |
nowen | hmm - I was going to suggest we try to start it and see | 14:53 |
Troy | just an update.. we are restore from a full backup 11/10 then we have daily incrementals | 15:01 |
nowen | so, are your users all back? | 15:06 |
Troy | we are still waiting on the restore | 15:07 |
Troy | # psql -d wikid -p 5432 or 5434 wikid-# SELECT pg_size_pretty(pg_database_size('wikid')) As fulldbsize; wikid=# VACUUM FULL VERBOSE; | 15:07 |
Troy | should I go ahead and run a VACUUM when it's back online? | 15:08 |
Troy | I will check first.. but we may run into the issue again if it's close to the max usage | 15:08 |
nowen | if you have a back up, then no harm in running vacuum, certainly | 15:08 |
nowen | 'vacuumdb -avfv -p 5434' | 15:09 |
nowen | and 'reindexdb -p 5434' | 15:10 |
mark___ | nick iphone | 15:13 |
mark___ | is the app missing from the store? | 15:13 |
mark___ | for iphones? | 15:13 |
nowen | mark___: yes. we had a show-stopper bug. no option to revert the binaries on the apple store, so we had to pull it. we have submitted a critical fix and await apple | 15:14 |
mark___ | ok | 15:14 |
nowen | very sorry | 15:14 |
nowen | mark___: how big are the disks on your wikid server? | 15:56 |
Troy | i think there was only 3GB allocated for the /var volume | 16:11 |
nowen | I would really like to see the output of the dbsize | 16:12 |
nowen | what's the current status? | 16:21 |
Troy | we are back up | 17:03 |
nowen | good to hear | 17:03 |
nowen | in /var/lib/pgsql, will you run 'du -h' | 17:04 |
nowen | and let me know how big /data is? | 17:05 |
mark___ | hi Nick | 17:19 |
Troy | 672K | 17:19 |
mark___ | do you know big the flat file is | 17:19 |
mark___ | for a failover | 17:19 |
nowen | mark___: i don't follow what flat file? | 17:19 |
mark___ | does the failover use postgres | 17:19 |
mark___ | when wikid failovers | 17:20 |
nowen | replication uses postgres | 17:20 |
mark___ | how does that process occur | 17:20 |
nowen | when the primary is down, you need to run wikidctl promote on the secondary | 17:21 |
mark___ | yes but what does that entail | 17:22 |
mark___ | when i run wikidctl | 17:22 |
mark___ | what all occurs | 17:22 |
mark___ | does the logs move over as well | 17:22 |
nowen | no, everything should be there. You give the secondary an ip that works and run it as the primary | 17:23 |
nowen | http://www.wikidsystems.com/support/wikid-support-center/installation-how-tos/how-to-configure-wikid-for-replication has the commands | 17:23 |
nowen | Troy: that doesn't seem very big. how much free disk space is there now? | 17:24 |
nowen | mark___: it can be scripted. is that what you want to do? | 17:30 |
mark___ | Nick here is our question | 17:52 |
mark___ | the primary server failed | 17:52 |
mark___ | due to lack of db space | 17:53 |
mark___ | when we failed over the secondary did not work | 17:53 |
mark___ | or did not come up correctly | 17:53 |
mark___ | any ideas as to why this did not work as expected | 17:54 |
Troy | does the secondary server flat files get updated on every db change on the primary? | 17:54 |
mark___ | and we have it scripted | 17:54 |
nowen | ah - I see - by flat-file, you mean not in the db. No, those must be synced by 'wikidctl sync'. But, those should not change often. E.g., the intermediate certs. | 17:56 |
nowen | how exactly did the secondary 'not work'. not able to get OTPs? or not able to login? | 18:04 |
mark___ | it did not show but 786 users | 18:24 |
mark___ | when it should have shown 5726 | 18:24 |
nowen | are they both up now? We can check to see if they are in sync | 18:25 |
Troy | yes.. it should've show the full amount.. but somehow it pulled in something else | 18:25 |
Troy | we are running just the primary in stand alone | 18:25 |
nowen | do you think it pulled in the mistaken amount from the primary? | 18:26 |
Troy | that's what I'm thinking | 18:27 |
nowen | so, we're back to what happened on the primary | 18:28 |
nowen | how much disk space is there now? | 18:28 |
Troy | they increased the volume.. i'll have to check the space now | 18:28 |
Troy | looks like they increase /var from 3GB to 5GB | 18:31 |
nowen | is it close to 3gb? | 18:37 |
nowen | 'df -h' | 18:37 |
nowen | I never know what people know ;-) | 18:37 |
*** mark___ has quit (Ping timeout: 250 seconds) | 18:44 | |
*** mark___ (8f74745b@gateway/web/freenode/ip.143.116.116.91) has joined #wikid | 19:18 | |
mark___ | Nick | 19:18 |
mark___ | our DBA has a question | 19:18 |
mark___ | we started everything back up | 19:18 |
nowen | ok | 19:18 |
mark___ | and he can no longer access the DB | 19:18 |
mark___ | and he thinks it is down | 19:19 |
mark___ | however wikid is working | 19:19 |
nowen | is he using the right port? | 19:19 |
mark___ | what port | 19:19 |
nowen | no replication == 5432 | 19:19 |
nowen | replication == 5434 | 19:19 |
mark___ | Nick | 19:36 |
nowen | yes | 19:37 |
mark___ | one sec | 19:37 |
nowen | k | 19:37 |
Troy | Nick .. what is the path to the postgres flat file that get sent over on the secondary? | 20:38 |
nowen | umm. there is no flat file for postgres. there's a utility that copies all the transactions in real time to the secondary | 20:39 |
nowen | if you would like to check that the timestamps are the same on the primary and secondary, you can run this: psql -h 127.0.0.1 -p 5434 -d wikid -U postgres -c "select | 20:42 |
nowen | max(timestamp) from logging_event" | 20:42 |
nowen | on the master, you can check the secondary using port 5436 instead of 5434 | 20:43 |
mark___ | is there a way to create a lag or holding folder before being applied in the event there was a corruption | 20:49 |
mark___ | just in case a corruption did occur it would not affect the secondary immediately? | 20:50 |
nowen | good question, let me dig on that | 20:50 |
mark___ | ok | 20:50 |
nowen | doesn't appear to be an option | 20:55 |
nowen | not looking good on that | 21:14 |
nowen | you could run a chron on the secondary that tar's up the db | 21:15 |
nowen | any idea what caused the corruption? was it the disk space? | 21:22 |
Troy | i guess nothing was corrupted.. just ran out of space and once the volume was increased, the db was brought back online | 21:44 |
nowen | hmm | 21:45 |
nowen | my question: how much space is there now? did something happen to chew it all up? or was it a slow roll? | 21:45 |
Troy | we did add quite a few more users over the last few months.. so i think it was just a slow roll | 21:46 |
Troy | we thought Nagios was monitoring the volumes.. but I guess that was never setup | 21:47 |
Troy | we've been faithful to archive the logs every 2 weeks | 21:47 |
nowen | how much space is there now? | 21:48 |
Troy | brb.. i need to run pick up my kids from school.. | 21:48 |
Troy | i think we have about 2GB free on the db volume | 21:48 |
nowen | are we cool for today? | 21:48 |
Troy | yea.. i think so.. i want to follow up with you next week if you are available Monday or Tuesday | 21:48 |
nowen | sure | 21:48 |
Troy | just in case any questions come up.. thanks! | 21:49 |
nowen | I'd like to keep a close eye on things for a bit | 21:49 |
nowen | email me if I'm not here, as always | 21:49 |
nowen | mark___: It's possible that I am being stupid on that quote too. long couple of days | 21:51 |
nowen | mark___: are we ok for the night? my eldest is in a play - I saw it last night, but would like to catch it again | 22:27 |
Troy | yes.. we are good for now | 22:28 |
nowen | ok | 22:28 |
nowen | sorry for the issue. | 22:29 |
*** nowen has quit (Quit: Leaving.) | 22:30 |
Generated by irclog2html.py 2.11.0 by Marius Gedminas - find it at mg.pov.lt!