07:18

Chris Nandor

anyone know how to fix this?

CRITICAL - found 1 databases with InnoDB tables. banjo has 1022976 kb free space left. (less than 1572864)


07:40

jboehm has joined this chat.

jboehm has left this chat.

jboehm has joined this chat.

07:49

Chris Nandor

[15:44:48] cnandor@slashdot-db-1 115 ~$ df -h

Filesystem            Size  Used Avail Use% Mounted on

/dev/mapper/RootVG-RootVol00

                      773G  394G  340G  54% /

/dev/sda1             244M   25M  207M  11% /boot

tmpfs                 7.9G  104K  7.9G   1% /dev/shm

filer-2-server-2.v22:/sd-general

                       99G   64G   35G  65% /usr/local


so it's not filesystem, it's MySQL's inno db i think

jmccarthy has joined this chat.

jmccarthy

on it

Chris Nandor

thanks jamie

CRITICAL - found 1 databases with InnoDB tables. banjo has 955392 kb free space left. (less than 1572864)


that was 15 minutes ago

jmccarthy

ok... looks like not a crisis yet, it still has a gig

netops' warnings are doing they job

*their

Chris Nandor

yeah but we lost half a gig in the last ... 7 hours or so

and with people waking up ... dunno if it will accelerate

jmccarthy

yoiks

Chris Nandor

first warning was 4:56 ET

well ... 1:56 PT.  might have been 3:56 ET.

whatever :)

so the plan is, indeed, to switch the master?

jmccarthy

probably. checking free on all dbs now

db-1: 927M db-2: 14897M db-3: 15110M db-4(sphinx): 3419M db-5(backup): 4579M db-5(test): full, replication stopped 2 weeks ago

Chris Nandor

is that bad?

the replication stopped?

jmccarthy

nah

Chris Nandor

i don't know what "test" is

jmccarthy

the test db was mainly one I kept around "just in case"

Chris Nandor

ah ok

jmccarthy

and did sample queries of my own on, that kind of thing

Chris Nandor

and db-2 is the writer?

the second master?

jmccarthy

yes

Chris Nandor

almost completely unrelated, jamie, are you guys using the Jira stuff @ ThinkGeek?

all of engineering is apparently using it, except Slashdot

tracker stuff

jmccarthy

no, we have our own trac on a VPN based in Fairfax

ok so the plan is

we take the site static. move the writer *and* reader to db-2. take site dynamic again.

Chris Nandor

ok

what shall i do?

jmccarthy

thus buying us 14 more gigs of inno space at the cost of less hardware running stuff

Chris Nandor

and then what do we do?  increase inno space on db-1?

jmccarthy

then I take db-1 down and give it more inno space, bring it back up, repoint readers from db-2 to db-{1,3}, bounce webheads

Chris Nandor

ok

why point away from db-3?

jmccarthy

because it replicates from db-1... when we stop db-1, -3 will reapidly get out of date

Chris Nandor

ah.

ok then!

tell me what to do if anything.

jmccarthy

hmmm

Chris Nandor

does sd-control -S fully take site static?  do we restart webheads after doing that?

jmccarthy

yes, if you restart it does

let's do something else first

let's take db-2 out of reader rotation, shut it down, and increase ITS inno space first

Chris Nandor

ok

jmccarthy

I will give it 8G more

then I will give db-1 another 24G, so both writers have ~8G more than the one reader db-3, so next time this happens, if someone does go horribly wrong and the nagios alerts aren't dealt with quickly enough, it will be db-3 that breaks badly first

Chris Nandor

hehe

jmccarthy

thus giving us warning and more time to fix the writers before *they* run out of inno, which is a much worse problem

Chris Nandor

yes

jmccarthy

slashd is down. taking 2 out of rotation, which is:

Chris Nandor

replication breaking, bad.  master breaking ... dammit.

jmccarthy

mysql> select * from dbs where virtual_user='slashdot02';

+----+--------------+---------+--------+--------+---------------+

| id | virtual_user | isalive | type   | weight | weight_adjust |

+----+--------------+---------+--------+--------+---------------+

| 12 | slashdot02   | yes     | reader |      2 |             1 | 

+----+--------------+---------+--------+--------+---------------+

1 row in set (0.00 sec)


mysql> update dbs set isalive='no' where virtual_user='slashdot02';

Query OK, 1 row affected (0.00 sec)

Rows matched: 1  Changed: 1  Warnings: 0

(wait 5ish seconds. tada. 02 out of rotation.)

which I am confirming by checking 'show processlist;' on db-2 and watching the connections die off

Chris Nandor

woo.

jmccarthy

11/01 11:12 sd-db-2:~$ ps auwwwwxf|grep mysqld

1870     24433  0.0  0.0  61156   704 pts/0    S+   16:12   0:00              \_ grep mysqld

slashdot 17376  0.0  0.0  63828  1224 ?        S    Sep11   0:00 /bin/sh bin/mysqld_safe --defaults-file=/srv/mysql-etc/my.cnf

slashdot 17405 40.0 30.3 5530732 4985864 ?     Sl   Sep11 29817:39  \_ /srv/mysql-5.0.51a-linux-x86_64-icc-glibc23/bin/mysqld --defaults-file=/srv/mysql-etc/my.cnf --basedir=/srv/mysql-5.0.51a-linux-x86_64-icc-glibc23 --datadir=/srv/mysql-data --pid-file=/srv/mysql-data/42rwhf1.ch3.sourceforge.com.pid --skip-external-locking --port=3306 --socket=/srv/mysql-run/mysql.sock

(hope you're logging this :)

Chris Nandor

yeah.

thinking about putting it into a text file.

might as well.  :)

jmccarthy

my ~/.my.cnf on db-2 has "socket=/srv/mysql-run/mysql.sock" which matches that --socket arg so I can just:

11/01 11:13 sd-db-2:~$ mysqladmin shutdown

11/01 11:13 sd-db-2:~$

I assume nagios is already bugging netops... cause they are going to get paged shortly

pudge if you can turn that off I would appreciate it

Chris Nandor

i will try

jmccarthy

going by the mysqld_safe line I do: susd vi /srv/mysql-etc/my.cnf

to the innodb_data_file_path param I append: ;ibdata41:2G;ibdata42:2G;ibdata43:2G;ibdata44:2G;ibdata45:2G

(ok I lied: 10G)

Chris Nandor

ok i think i disabled the checks

also on sd-db-1

jmccarthy

11/01 11:16 sd-db-2:/srv/mysql$ pwd

/srv/mysql

11/01 11:16 sd-db-2:/srv/mysql$ susd bin/mysqld_safe --defaults-file=/srv/mysql-etc/my.cnf

Starting mysqld daemon with databases from /srv/mysql-data

Chris Nandor

and sd-db-5 is low on space.  91% used.  :)

what can i do to fix that?

i think i know something is deleted

but not sure what

jmccarthy

we might want to just give up on that unused 'test' db on db-5

presumably a few months of its binlogs are still lying around

Chris Nandor

just go in and drop database?  :)

jmccarthy

hah not yet

gimme a sec

11/01 11:20 sd-db-2:~$ mysql banjo

Reading table information for completion of table and column names

You can turn off this feature to get a quicker startup with -A


mysql> show table status like 'al2' \G

*************************** 1. row ***************************

           Name: al2

         Engine: InnoDB

        Version: 7

     Row_format: Redundant

           Rows: 242409

Avg_row_length: 67

    Data_length: 16302080

Max_data_length: 0

   Index_length: 8749056

      Data_free: 0

Auto_increment: NULL

    Create_time: 2008-05-24 20:15:40

    Update_time: NULL

     Check_time: NULL

      Collation: latin1_swedish_ci

       Checksum: NULL

Create_options: 

        Comment: InnoDB free: 25237504 kB

1 row in set (0.04 sec)


mysql> slave start;

Query OK, 0 rows affected (0.01 sec)


           Slave_IO_Running: Yes

          Slave_SQL_Running: Yes


few sec later:

      Seconds_Behind_Master: 0


db-1:

mysql> update dbs set isalive='yes' where virtual_user='slashdot02';

Query OK, 1 row affected (0.00 sec)

Rows matched: 1  Changed: 1  Warnings: 0


show processlist confirms connections to -2.

ok now we xfer the master from 1 to 2 and turn off the other readers, then I take a break

actually you can handle this I think

Chris Nandor

so you gave more space to db-2?

jmccarthy

yes

Chris Nandor scratches his head

jmccarthy

you might as well do this to learn it :) so ... edit `pmpath DBIx::Password` to point the 'slashdot' user from db-1 to db-2

Chris Nandor

do we change the master in DBIx::Password and the dbs table?

nod

dbs table too?

or no?

jmccarthy

for now leave the 'slashdot02' user as a reader, and leave it active

set isalive='no' for the db-3 reader

in the dbs table

no need to edit DBIx::Password for that

then take site static, bounce webheads, take dynamic, bounce webheads again

site should come up with all db traffic, reads and writes, to db-2

Chris Nandor

ok DBIx::Password changed

i will now take static, kick webheads, take dynamic, kick webheads

we need to do static/dynamic in there?

jmccarthy

yes

if you just bounce webheads while dynamic, you'll get some writes to -1 and some to -2 and that can cause problems

Chris Nandor

nod

thought so, just wanted to hear it from you :)

08:30

Chris Nandor

kicking dynamic

jmccarthy

and remember to restart slashd :)

Chris Nandor

nod :)

ok we should be dynamic and live and slashd should be running

Chris Nandor waits for site to respond

Chris Nandor

there she goes

08:58

jmccarthy

we all good?

09:07

Chris Nandor

seems so

no problems noted

question

so when we switch to db-2 master, we go from even to odd, or odd to even, for some tables?

jmccarthy

odd to even for all tables

Chris Nandor

for those tables, is the next even/odd greater than the previous odd/even, or is it greater than the previous even/odd?

jmccarthy

the tables don't know whether they're on even or odd, each table has an attribute called AUTO_INCREMENT assigned to it which is the last assigned value

when mysql creates a new row it takes the smallest available value according to its (current) rules which is greater than that value

mysql> show create table comments \G

*************************** 1. row ***************************

       Table: comments

Create Table: CREATE TABLE `comments` (

  `sid` mediumint(8) unsigned NOT NULL default '0',

  `cid` int(8) unsigned NOT NULL auto_increment,

[...]

  KEY `date_sid` (`date`,`sid`)

) ENGINE=InnoDB AUTO_INCREMENT=29942449 DEFAULT CHARSET=latin1


the effect of this is that, across the switch, each table will increment by 1 and then go back to incrementing by 2's

Chris Nandor

ok, so max stays max

which can be broken if we are writing to both simultaneously

somewhat

ok lemme know if/when you wanna switch back

jmccarthy

I'm getting some lunch, can we reconvene here in an hour?

Chris Nandor

ok

see you then, thanks

09:20

precision has joined this chat.

precision

CRITICAL - found 1 databases with InnoDB tables. banjo has 893952 kb free space left. (less than 1572864)


on slashdot-db-1

jboehm

i think that's what jamie and pudge have been working on

precision

oh cool

09:26

precision has left this chat.

09:41

jmccarthy

heh, on db-1, these queries have been running for 4.6 days

*************************** 6. row ***************************

     Id: 5775588

   User: root

   Host: localhost

     db: banjo

Command: Query

   Time: 401469

  State: Copying to tmp table

   Info: select count(*) from firehose_update_log_temp where uid in(select uid from firehose_update_log_temp where uid!=666 group by uid having max(total_num)>10) and total_num <= 10

*************************** 7. row ***************************

     Id: 5775861

   User: root

   Host: localhost

     db: banjo

Command: Query

   Time: 401270

  State: Copying to tmp table

   Info: select count(*) from firehose_update_log_temp where uid not in(select uid from firehose_update_log_temp where uid!=666 group by uid having max(total_num)<=10) and total_num <= 10


Chris Nandor

heh

that is a long time, it seems to me.

10:01

tlord has joined this chat.

10:09

Chris Nandor

jamie, we gonna do stuff?

jboehm has left this chat.

10:13

jmccarthy

nod. gimme a few 

10:23

jmccarthy

I think those queries are matching a 1.4M row table against itself, for a total scan of 2295703764964 rows

each

ok, stopping sd-db-1, downing it, editing its config, starting it.

10:29

jmccarthy

hmmmm. actually

after killing those threads, now on db-1 I see significant "wait" states for the cpu

mysql> update dbs set isalive='no' where virtual_user='reader03';

Query OK, 1 row affected (0.00 sec)

Rows matched: 1  Changed: 1  Warnings: 0

10:47

jmccarthy

I'm not ecstatic about these wait states, I wonder if it's doing cleanup on those 2 threads

10:59

Chris Nandor

try restarting it ... ?

11:06

Chris Nandor

need me to do anything?

11:19

jmccarthy

nah

wait states gone. restarting mysqld on db-1

12:33

Chris Nandor

moved back to sd-1?

jmccarthy

no, writer is still 2

but I am pointing 'slashdot02' at db-1 to use as a reader

I know it's a little confusing to have { slashdot => 'db-2', slashdot02 => 'db-1' } but it doesn't hurt anything to leave it that way

bot

256bcaa..abb387b refs/heads/master (jmccarthy)

abb387b... Catch up mysql configs for sd-db-{1-4} with reality

Chris Nandor

yeah

whatever works

should i restart all the checks on nagios?

did you?

oh i see one check i didn't disable

jmccarthy

I did not. You should now.

12:45

jmccarthy

db-1 is bogging

12:49

jmccarthy

I disabled it to let the bog clear, then re-enabled it.

Hopefully its caches are hotter now and it will handle the load better.

For the record the db balancing task was (correctly) setting its (and db-3's, which replicates from it) weight to 0.01

but I think that task increases weights too quickly. If it did it more gradually, caches would probably warm up better.