[ I know, I know, I'm still writing tech crap, just still working through it folks. Ill be back to gibberish one of these days. ]
OK, so here is a question I get:
Whats wrong with this?
CREATE TABLE IF NOT EXISTS `users_active_data` (
`id` int(128) NOT NULL auto_increment,
`users_id` int(128) NOT NULL,
`meta_key` varchar(32) NOT NULL,
`meta_group` varchar(128) NOT NULL,
`meta_data` varchar(128) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;
INSERT INTO `users_active_data` (`id`, `users_id`, `meta_key`, `meta_group`, `meta_data`) VALUES
(41, 9, 'lost_fan', 'fools', '1'),
(29, 1, 'gender', 'f', '1'),
(39, 9, 'larper', 'acquiescence', '1'),
(48, 9, 'lost_fan', 'zombies', '1'),
(12, 1, 'age', '26-35', '0'),
(22, 1, 'larper', 'acquiescence', '0'),
(24, 1, 'larper', 'fools', '0'),
(23, 1, 'larper', 'phonies', '0'),
(25, 3, 'larper', 'dating', '1'),
(26, 3, 'lost_fan', 'zombies', '1'),
(27, 3, 'gender', 'm', '1'),
(28, 3, 'age', '26-35', '1'),
(30, 1, 'lost_fan', 'zombies', '1'),
(31, 1, 'larper', 'dating', '1'),
(40, 9, 'larper', 'dating', '1'),
(42, 9, 'gender', 'f', '1'),
(43, 9, 'age', '26-35', '1'),
(44, 9, 'age', '36-45', '1'),
(45, 9, 'age', '18-25', '1'),
(46, 9, 'age', '46-100', '1'),
(47, 9, 'gender', 'm', '1'),
(49, 9, 'larper', 'fools', '1'),
(50, 9, 'larper', 'phonies', '1');
SELECT users_id FROM users_active_data WHERE ((meta_key = 'gender' AND meta_group = 'm') ) AND ((meta_key = 'age' AND meta_group = '26-35') ) AND ((meta_key = 'lost_fan' AND meta_group = 'zombies') ) AND ((meta_key = 'larper' AND meta_group = 'acquiescence') OR (meta_key = 'larper' AND meta_group = 'fools') OR (meta_key = 'larper' AND meta_group = 'phonies') OR (meta_key = 'larper' AND meta_group = 'dating') )
(keys and groups changed to protect the app, i guess.)
Everything.
First off, read this:
http://en.wikipedia.org/wiki/Third_normal_form
(and maybe this: http://en.wikipedia.org/wiki/Larping )
Second, you are just doing too damn much with one table. There are a couple of patterns here:
a) You should have domain tables. The age and (to a lesser extent) gender, should just be ID’s that can be looked up either as a constant in code (GENDER_MALE=1, GENDER_FEMALE=2), or have a join table that lists these. A data centric model would prefer the latter, because it would be independent of the client code.
b) Similarly with the age range. It is much more space efficient (and query, and index and whatever else) to store a known value as something that can be looked up. That way, if you have a million users and want to change the age bracket of ’25-30′ to ’25 – 30′ you only update one row instead of a million. It also mitigates the risk of two different queries writing incompatible versions (like ’25-30′ or ’25 -30′ or having some dumb php problem that evaluates that as an expression and makes it equal -5.
c) key and value combinations are whats called an attribute table. This is the core of all that NOSQL stuff thats so popular today: storing a document as a series of key-value pairs. Thats what json/yaml basically are as well. XML, too, but its clunky to work with. Regardless, you need another table that just lists key and value pairs that is independent of the user and their personal data.
d) Some of these should be first class columns [what you call meta_data=0]. Attributes may seem appealing as an easy way to extend your model, but it comes at a cost. Even in a well tuned RDBMS, you have the cost of a join to the attribute table as well as iterating through the properties until you have a match. On the other hand, adding a required column to the primary (user) table means that you can retrieve that data directly when looking up a user without the cost of a join and a collection iteration (if you search every property and don’t find what you are looking for, then your access of that value is linear time vs. the number of attributes per user) .
e) Use innoDb and not MyISAM as a table type. InnoDB supports row locking and is much more transactionally safe, which means if things go wrong your database won’t get _AS_ fucked up. Also, with multiple writes to a single table, InnoDB will handle it better, since MyISAM locks the table instead of the row.
f) Use parameters in code. Don’t build strings of sql statements, or write a sanitize() method to try to strip out special characters. Prepare statement ($stmt = SELECT * from User where id=?) and call it (excute($stmt, 27)). It speeds up your queries and helps prevent sql injection attacks.
g) Always use a transaction: begin; ; if(everythings_ok) commit; else rollback; . Always. Always. Always. Any time you modify a prod DB with raw sql. Always. Always.
Third: Data is your key asset. Your database is what truly makes your application valuable. While it may seem like some extra work to properly organize and design your tables, it will save you untold hours with having easier sql queries, reliable data, and a data model that can be queried in ways other than your core application. Doing a proper data model means that you can write new programs to work with the data and handle complex queries to analyze what you have collected as hard data to improve your app. If you write a simple, robust application to acquire data, there is no end to the number of clients you can write to work with it.
It just has to be accessible and logical.
Finally: Script your database test setups. You don’t actually write to the database with raw sql, so why should your test data get that way. Ideally, you would have modular functions and a test harness that can import your data directly through the functions that your application uses. For example, if you have a php script called register_account.php, that takes a bunch of parameters to register an account, then make that script only handle parsing parameters, passing them to a function that handles the actual transaction, and then shows the page. PHP doesn’t really encourage this kind of programming, unfortunately.
So: register_account parses parameters, and then calls db_layer.php#write_user(%userdata), which actually does the database write. If you set it up this way, you can make a script called create_user_data.php, that just calls the same method with a bunch of mock objects and sets up the DB. Its better to use a DB like sqlite because it is quick and easy to work with the schema before you commit to a full on RDBMS.
That being said, ill try to scrap together a quick model to show you what I am talking about.
(see part 2 tomorrow).