07:18
Chris Nandor
anyone know how to fix this?
CRITICAL - found 1 databases with InnoDB tables. banjo has 1022976 kb free space left. (less than 1572864)
07:40
jboehm has joined this chat.
jboehm has left this chat.
jboehm has joined this chat.
07:49
Chris Nandor
[15:44:48] cnandor@slashdot-db-1 115 ~$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/RootVG-RootVol00
773G 394G 340G 54% /
/dev/sda1 244M 25M 207M 11% /boot
tmpfs 7.9G 104K 7.9G 1% /dev/shm
filer-2-server-2.v22:/sd-general
99G 64G 35G 65% /usr/local
so it's not filesystem, it's MySQL's inno db i think
jmccarthy has joined this chat.
jmccarthy
on it
Chris Nandor
thanks jamie
CRITICAL - found 1 databases with InnoDB tables. banjo has 955392 kb free space left. (less than 1572864)
that was 15 minutes ago
jmccarthy
ok... looks like not a crisis yet, it still has a gig
netops' warnings are doing they job
*their
Chris Nandor
yeah but we lost half a gig in the last ... 7 hours or so
and with people waking up ... dunno if it will accelerate
jmccarthy
yoiks
Chris Nandor
first warning was 4:56 ET
well ... 1:56 PT. might have been 3:56 ET.
whatever :)
so the plan is, indeed, to switch the master?
jmccarthy
probably. checking free on all dbs now
db-1: 927M db-2: 14897M db-3: 15110M db-4(sphinx): 3419M db-5(backup): 4579M db-5(test): full, replication stopped 2 weeks ago
Chris Nandor
is that bad?
the replication stopped?
jmccarthy
nah
Chris Nandor
i don't know what "test" is
jmccarthy
the test db was mainly one I kept around "just in case"
Chris Nandor
ah ok
jmccarthy
and did sample queries of my own on, that kind of thing
Chris Nandor
and db-2 is the writer?
the second master?
jmccarthy
yes
Chris Nandor
almost completely unrelated, jamie, are you guys using the Jira stuff @ ThinkGeek?
all of engineering is apparently using it, except Slashdot
tracker stuff
jmccarthy
no, we have our own trac on a VPN based in Fairfax
ok so the plan is
we take the site static. move the writer *and* reader to db-2. take site dynamic again.
Chris Nandor
ok
what shall i do?
jmccarthy
thus buying us 14 more gigs of inno space at the cost of less hardware running stuff
Chris Nandor
and then what do we do? increase inno space on db-1?
jmccarthy
then I take db-1 down and give it more inno space, bring it back up, repoint readers from db-2 to db-{1,3}, bounce webheads
Chris Nandor
ok
why point away from db-3?
jmccarthy
because it replicates from db-1... when we stop db-1, -3 will reapidly get out of date
Chris Nandor
ah.
ok then!
tell me what to do if anything.
jmccarthy
hmmm
Chris Nandor
does sd-control -S fully take site static? do we restart webheads after doing that?
jmccarthy
yes, if you restart it does
let's do something else first
let's take db-2 out of reader rotation, shut it down, and increase ITS inno space first
Chris Nandor
ok
jmccarthy
I will give it 8G more
then I will give db-1 another 24G, so both writers have ~8G more than the one reader db-3, so next time this happens, if someone does go horribly wrong and the nagios alerts aren't dealt with quickly enough, it will be db-3 that breaks badly first
Chris Nandor
hehe
jmccarthy
thus giving us warning and more time to fix the writers before *they* run out of inno, which is a much worse problem
Chris Nandor
yes
jmccarthy
slashd is down. taking 2 out of rotation, which is:
Chris Nandor
replication breaking, bad. master breaking ... dammit.
jmccarthy
mysql> select * from dbs where virtual_user='slashdot02';
+----+--------------+---------+--------+--------+---------------+
| id | virtual_user | isalive | type | weight | weight_adjust |
+----+--------------+---------+--------+--------+---------------+
| 12 | slashdot02 | yes | reader | 2 | 1 |
+----+--------------+---------+--------+--------+---------------+
1 row in set (0.00 sec)
mysql> update dbs set isalive='no' where virtual_user='slashdot02';
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0
(wait 5ish seconds. tada. 02 out of rotation.)
which I am confirming by checking 'show processlist;' on db-2 and watching the connections die off
Chris Nandor
woo.
jmccarthy
11/01 11:12 sd-db-2:~$ ps auwwwwxf|grep mysqld
1870 24433 0.0 0.0 61156 704 pts/0 S+ 16:12 0:00 \_ grep mysqld
slashdot 17376 0.0 0.0 63828 1224 ? S Sep11 0:00 /bin/sh bin/mysqld_safe --defaults-file=/srv/mysql-etc/my.cnf
slashdot 17405 40.0 30.3 5530732 4985864 ? Sl Sep11 29817:39 \_ /srv/mysql-5.0.51a-linux-x86_64-icc-glibc23/bin/mysqld --defaults-file=/srv/mysql-etc/my.cnf --basedir=/srv/mysql-5.0.51a-linux-x86_64-icc-glibc23 --datadir=/srv/mysql-data --pid-file=/srv/mysql-data/42rwhf1.ch3.sourceforge.com.pid --skip-external-locking --port=3306 --socket=/srv/mysql-run/mysql.sock
(hope you're logging this :)
Chris Nandor
yeah.
thinking about putting it into a text file.
might as well. :)
jmccarthy
my ~/.my.cnf on db-2 has "socket=/srv/mysql-run/mysql.sock" which matches that --socket arg so I can just:
11/01 11:13 sd-db-2:~$ mysqladmin shutdown
11/01 11:13 sd-db-2:~$
I assume nagios is already bugging netops... cause they are going to get paged shortly
pudge if you can turn that off I would appreciate it
Chris Nandor
i will try
jmccarthy
going by the mysqld_safe line I do: susd vi /srv/mysql-etc/my.cnf
to the innodb_data_file_path param I append: ;ibdata41:2G;ibdata42:2G;ibdata43:2G;ibdata44:2G;ibdata45:2G
(ok I lied: 10G)
Chris Nandor
ok i think i disabled the checks
also on sd-db-1
jmccarthy
11/01 11:16 sd-db-2:/srv/mysql$ pwd
/srv/mysql
11/01 11:16 sd-db-2:/srv/mysql$ susd bin/mysqld_safe --defaults-file=/srv/mysql-etc/my.cnf
Starting mysqld daemon with databases from /srv/mysql-data
Chris Nandor
and sd-db-5 is low on space. 91% used. :)
what can i do to fix that?
i think i know something is deleted
but not sure what
jmccarthy
we might want to just give up on that unused 'test' db on db-5
presumably a few months of its binlogs are still lying around
Chris Nandor
just go in and drop database? :)
jmccarthy
hah not yet
gimme a sec
11/01 11:20 sd-db-2:~$ mysql banjo
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
mysql> show table status like 'al2' \G
*************************** 1. row ***************************
Name: al2
Engine: InnoDB
Version: 7
Row_format: Redundant
Rows: 242409
Avg_row_length: 67
Data_length: 16302080
Max_data_length: 0
Index_length: 8749056
Data_free: 0
Auto_increment: NULL
Create_time: 2008-05-24 20:15:40
Update_time: NULL
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment: InnoDB free: 25237504 kB
1 row in set (0.04 sec)
mysql> slave start;
Query OK, 0 rows affected (0.01 sec)
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
few sec later:
Seconds_Behind_Master: 0
db-1:
mysql> update dbs set isalive='yes' where virtual_user='slashdot02';
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0
show processlist confirms connections to -2.
ok now we xfer the master from 1 to 2 and turn off the other readers, then I take a break
actually you can handle this I think
Chris Nandor
so you gave more space to db-2?
jmccarthy
yes
Chris Nandor scratches his head
jmccarthy
you might as well do this to learn it :) so ... edit `pmpath DBIx::Password` to point the 'slashdot' user from db-1 to db-2
Chris Nandor
do we change the master in DBIx::Password and the dbs table?
nod
dbs table too?
or no?
jmccarthy
for now leave the 'slashdot02' user as a reader, and leave it active
set isalive='no' for the db-3 reader
in the dbs table
no need to edit DBIx::Password for that
then take site static, bounce webheads, take dynamic, bounce webheads again
site should come up with all db traffic, reads and writes, to db-2
Chris Nandor
ok DBIx::Password changed
i will now take static, kick webheads, take dynamic, kick webheads
we need to do static/dynamic in there?
jmccarthy
yes
if you just bounce webheads while dynamic, you'll get some writes to -1 and some to -2 and that can cause problems
Chris Nandor
nod
thought so, just wanted to hear it from you :)
08:30
Chris Nandor
kicking dynamic
jmccarthy
and remember to restart slashd :)
Chris Nandor
nod :)
ok we should be dynamic and live and slashd should be running
Chris Nandor waits for site to respond
Chris Nandor
there she goes
08:58
jmccarthy
we all good?
09:07
Chris Nandor
seems so
no problems noted
question
so when we switch to db-2 master, we go from even to odd, or odd to even, for some tables?
jmccarthy
odd to even for all tables
Chris Nandor
for those tables, is the next even/odd greater than the previous odd/even, or is it greater than the previous even/odd?
jmccarthy
the tables don't know whether they're on even or odd, each table has an attribute called AUTO_INCREMENT assigned to it which is the last assigned value
when mysql creates a new row it takes the smallest available value according to its (current) rules which is greater than that value
mysql> show create table comments \G
*************************** 1. row ***************************
Table: comments
Create Table: CREATE TABLE `comments` (
`sid` mediumint(8) unsigned NOT NULL default '0',
`cid` int(8) unsigned NOT NULL auto_increment,
[...]
KEY `date_sid` (`date`,`sid`)
) ENGINE=InnoDB AUTO_INCREMENT=29942449 DEFAULT CHARSET=latin1
the effect of this is that, across the switch, each table will increment by 1 and then go back to incrementing by 2's
Chris Nandor
ok, so max stays max
which can be broken if we are writing to both simultaneously
somewhat
ok lemme know if/when you wanna switch back
jmccarthy
I'm getting some lunch, can we reconvene here in an hour?
Chris Nandor
ok
see you then, thanks
09:20
precision has joined this chat.
precision
CRITICAL - found 1 databases with InnoDB tables. banjo has 893952 kb free space left. (less than 1572864)
on slashdot-db-1
jboehm
i think that's what jamie and pudge have been working on
precision
oh cool
09:26
precision has left this chat.
09:41
jmccarthy
heh, on db-1, these queries have been running for 4.6 days
*************************** 6. row ***************************
Id: 5775588
User: root
Host: localhost
db: banjo
Command: Query
Time: 401469
State: Copying to tmp table
Info: select count(*) from firehose_update_log_temp where uid in(select uid from firehose_update_log_temp where uid!=666 group by uid having max(total_num)>10) and total_num <= 10
*************************** 7. row ***************************
Id: 5775861
User: root
Host: localhost
db: banjo
Command: Query
Time: 401270
State: Copying to tmp table
Info: select count(*) from firehose_update_log_temp where uid not in(select uid from firehose_update_log_temp where uid!=666 group by uid having max(total_num)<=10) and total_num <= 10
Chris Nandor
heh
that is a long time, it seems to me.
10:01
tlord has joined this chat.
10:09
Chris Nandor
jamie, we gonna do stuff?
jboehm has left this chat.
10:13
jmccarthy
nod. gimme a few
10:23
jmccarthy
I think those queries are matching a 1.4M row table against itself, for a total scan of 2295703764964 rows
each
ok, stopping sd-db-1, downing it, editing its config, starting it.
10:29
jmccarthy
hmmmm. actually
after killing those threads, now on db-1 I see significant "wait" states for the cpu
mysql> update dbs set isalive='no' where virtual_user='reader03';
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0
10:47
jmccarthy
I'm not ecstatic about these wait states, I wonder if it's doing cleanup on those 2 threads
10:59
Chris Nandor
try restarting it ... ?
11:06
Chris Nandor
need me to do anything?
11:19
jmccarthy
nah
wait states gone. restarting mysqld on db-1
12:33
Chris Nandor
moved back to sd-1?
jmccarthy
no, writer is still 2
but I am pointing 'slashdot02' at db-1 to use as a reader
I know it's a little confusing to have { slashdot => 'db-2', slashdot02 => 'db-1' } but it doesn't hurt anything to leave it that way
bot
256bcaa..abb387b refs/heads/master (jmccarthy)
abb387b... Catch up mysql configs for sd-db-{1-4} with reality
Chris Nandor
yeah
whatever works
should i restart all the checks on nagios?
did you?
oh i see one check i didn't disable
jmccarthy
I did not. You should now.
12:45
jmccarthy
db-1 is bogging
12:49
jmccarthy
I disabled it to let the bog clear, then re-enabled it.
Hopefully its caches are hotter now and it will handle the load better.
For the record the db balancing task was (correctly) setting its (and db-3's, which replicates from it) weight to 0.01
but I think that task increases weights too quickly. If it did it more gradually, caches would probably warm up better.