Scrame

May 31, 2010

Daily#147: May — tl;dr

Filed under: Uncategorized — scrame @ 8:35 pm

Eh, just as an update: A couple bitter entries. Steroids. Also, some java rants, and bay to breakers. Search for those. Of course the iPad will become the premier games platform.. Side notes on protocol buffers. Marble Hornets. And a multipart discussion about database schema.

Pretty tech heavy, but april was centered almost completely around script frenzy, so there was a bit to catch up on. Still have a few things on the backlog for next month, though.

5 months. Haters, y’all know what to suck.

May 30, 2010

Daily #146: Basic Schema follow-up.

Filed under: Technology,daily — scrame @ 4:27 pm

[This is the end of the post, mostly now just covers handling lists and querying to extract ids from property tables without the necessarily keying the main table.]

So, in your programs data model, try to think about how your user data should be structured.

First class data:
string name
char gender
string IP
The collection data:
Map attributes
List excludes
etc

The attributes table handles single properties that you can add to a user but are not mandatory, like cell_phone_number, fax_number, eye_color, whatever. That is a model of a map. Typically with attributes and properties, you really just need the one.

For a list of specific values, you need anther table to hold the list of users to exclude.

user_exclude, that just has 2 columns.
int user_id, int exclude_id

So then a list of ID’s the user doesn’t want to talk to is pulled in like so:

select exclude_id from user_exclude where id=?

So, create an index on that table for both user_id, and (user_id, exclude_id), so you will be able to search for the whole list, or for an exact match, respectively.

You can add additional lists, of course, they just need to follow the same pattern.

So consider the same thing for “interests”:

Table interests
int id
string name
auto_index

insert into interests (name) values
(looking),
(fun),
(business),
etc.

Then a table user_interests:
int user_id
int interest_id

set up similar indicies as above. (one for id, one for both). Also make a unique key of the row.

So now we are looking at two users:

id:1
name: User a:
gender: ‘M’
interests: looking, fun

- and -

id:2
name: User b:
gender ‘F’
interests: business, fun

So then try this:

//get a list of unique user ids from the interests table
select distinct ui2.id as match from user_interests ui1, user_interests ui2, user_excludes ue where

//the user that is connecting (the question mark is a parameter)
ui1 = ? AND

//the selected user is not the user where are querying
ui1.id <> ui2.id AND

//inner join against the excludes table:
//so the returned id will never be someting on the users exclude list
ui1.id = ue.user_id AND
ui2.id != ue_exclude_id AND

//and at least one property matches
ui1.interest_id = ui2.interest_id

That will return a list of users with at least one interest in common. This only matches on interests, but user could be joined back and sorted with a couple extra clauses:

Amend the first line to read:
select dustinct uid2.id from user u, user_attribute uid1 … etc
then with the rest of the wheres add:

AND
u.id = ui2.id AND
u.gender = ‘F’ AND
u.whatever = ‘bar’
etc…

You can make stricter matches on the intesrest (or whatever) lists, but if you are going to put real procedural logic, you should write a service to handle the matching, that just keeps all the online users in memory. It will be faster, easier to maintain and wont require as much database overhead.

Ex: Your initial table was MyISAM, if you ran this query against that database, which locks on write, then this service would hang anytime someone registers, or you need to write data to one of these tables. A separate service could eliminate that kind of deadlock, but with real-time/concurrent systems there are a few others to worry about.

[The end of it -- done offline was a suggestion that they write a service that just handles this stuff in memory instead of leaning on database / set theory to match users up. Maybe ill tackle that tomorrow.]

May 29, 2010

Daily #145: Continued.

Filed under: daily — scrame @ 4:23 pm

This was mostly done in meetings. have some follow-up notes. Ill post later.

May 28, 2010

Daily #144: Third Normal Form (minus one or two).

Filed under: Technology,daily — scrame @ 2:31 am

So the last entry was mostly a rant. After that, I did a quick sketch of a database that might address the initial problem. Here is a revised map.

From the initial suggestions, I had a separate gender table, but that was over-done. I moved that to a single character column on the user table. Genders aren’t really a domain that gets extended, and usually a simple M or F suffices, unless you are making a new kind of social network.

You can get the sql here.

Sorry about the weird URL.

The above schema was drawn with sql designer

You can load the project xml here

So, a quick run down of the tables:

user: carries first class user data. This example has name (varchar(255)) and gender(char(1)) as first class columns. Ideally, you should have a minimum of non-nullable columns, and values that represent common domains (age range, zip code, state, salutation, etc)

age_range: a domain table that maps an int (ID) to a string that represents and age range.

user_attributes: name-value pairs keyed to a specific user. The table can be queried directly if you know the user ID. Attributes can be queried through joins, but no extra code processing.

user_document: a table that keys a user ID to a text blob. This is an optional table, that if a row matching a user exists, returns a text blob containing a document (JSON, XML, binary, etc) that can be parsed by the calling program. It is impossible to query for specific attributes, but it is useful for containing secondary information, but is not required of any user.

With all of that structure, the task is up to you to determine what information is primary (a first-class column for users), what is secondary (an attribute that can be directly queried through a join, or a property that can be searched for from a more general query), and if you even need a document column (if you are collecting arbitrary data that it inconsistent enough to not deserve a secondary attribute, or a piece of data too large to fit in a varchar(255) ).

With sqk designer, the layout xml and raw sql, you should at least be able to set up a database with the tables. Bonus points if you can automate dropping and creating the tables from the schema.

(Stay tuned for part 3: data/queries)

May 27, 2010

Daily #143: A brief break.

Filed under: Errata,daily — scrame @ 9:11 pm

I have a bit more from the work i did last night, but its not quite ready to go. Im going to try to push it up in the morning.

May 26, 2010

Daily #142: Third Normal Form [part 1]

Filed under: Technology,daily — scrame @ 11:00 pm

[ I know, I know, I'm still writing tech crap, just still working through it folks. Ill be back to gibberish one of these days. ]

OK, so here is a question I get:

Whats wrong with this?


CREATE TABLE IF NOT EXISTS `users_active_data` (
  `id` int(128) NOT NULL auto_increment,
  `users_id` int(128) NOT NULL,
  `meta_key` varchar(32) NOT NULL,
  `meta_group` varchar(128) NOT NULL,
  `meta_data` varchar(128) NOT NULL,
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM  DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

INSERT INTO `users_active_data` (`id`, `users_id`, `meta_key`, `meta_group`, `meta_data`) VALUES
(41, 9, 'lost_fan', 'fools', '1'),
(29, 1, 'gender', 'f', '1'),
(39, 9, 'larper', 'acquiescence', '1'),
(48, 9, 'lost_fan', 'zombies', '1'),
(12, 1, 'age', '26-35', '0'),
(22, 1, 'larper', 'acquiescence', '0'),
(24, 1, 'larper', 'fools', '0'),
(23, 1, 'larper', 'phonies', '0'),
(25, 3, 'larper', 'dating', '1'),
(26, 3, 'lost_fan', 'zombies', '1'),
(27, 3, 'gender', 'm', '1'),
(28, 3, 'age', '26-35', '1'),
(30, 1, 'lost_fan', 'zombies', '1'),
(31, 1, 'larper', 'dating', '1'),
(40, 9, 'larper', 'dating', '1'),
(42, 9, 'gender', 'f', '1'),
(43, 9, 'age', '26-35', '1'),
(44, 9, 'age', '36-45', '1'),
(45, 9, 'age', '18-25', '1'),
(46, 9, 'age', '46-100', '1'),
(47, 9, 'gender', 'm', '1'),
(49, 9, 'larper', 'fools', '1'),
(50, 9, 'larper', 'phonies', '1');

SELECT users_id FROM users_active_data WHERE ((meta_key = 'gender' AND meta_group = 'm') ) AND ((meta_key = 'age' AND meta_group = '26-35') ) AND ((meta_key = 'lost_fan' AND meta_group = 'zombies') ) AND ((meta_key = 'larper' AND meta_group = 'acquiescence') OR (meta_key = 'larper' AND meta_group = 'fools') OR (meta_key = 'larper' AND meta_group = 'phonies') OR (meta_key = 'larper' AND meta_group = 'dating') )

(keys and groups changed to protect the app, i guess.)

Everything.

First off, read this:

http://en.wikipedia.org/wiki/Third_normal_form

(and maybe this: http://en.wikipedia.org/wiki/Larping )

Second, you are just doing too damn much with one table. There are a couple of patterns here:

a) You should have domain tables. The age and (to a lesser extent) gender, should just be ID’s that can be looked up either as a constant in code (GENDER_MALE=1, GENDER_FEMALE=2), or have a join table that lists these. A data centric model would prefer the latter, because it would be independent of the client code.

b) Similarly with the age range. It is much more space efficient (and query, and index and whatever else) to store a known value as something that can be looked up. That way, if you have a million users and want to change the age bracket of ’25-30′ to ’25 – 30′ you only update one row instead of a million. It also mitigates the risk of two different queries writing incompatible versions (like ’25-30′ or ’25 -30′ or having some dumb php problem that evaluates that as an expression and makes it equal -5.

c) key and value combinations are whats called an attribute table. This is the core of all that NOSQL stuff thats so popular today: storing a document as a series of key-value pairs. Thats what json/yaml basically are as well. XML, too, but its clunky to work with. Regardless, you need another table that just lists key and value pairs that is independent of the user and their personal data.

d) Some of these should be first class columns [what you call meta_data=0]. Attributes may seem appealing as an easy way to extend your model, but it comes at a cost. Even in a well tuned RDBMS, you have the cost of a join to the attribute table as well as iterating through the properties until you have a match. On the other hand, adding a required column to the primary (user) table means that you can retrieve that data directly when looking up a user without the cost of a join and a collection iteration (if you search every property and don’t find what you are looking for, then your access of that value is linear time vs. the number of attributes per user) .

e) Use innoDb and not MyISAM as a table type. InnoDB supports row locking and is much more transactionally safe, which means if things go wrong your database won’t get _AS_ fucked up. Also, with multiple writes to a single table, InnoDB will handle it better, since MyISAM locks the table instead of the row.

f) Use parameters in code. Don’t build strings of sql statements, or write a sanitize() method to try to strip out special characters. Prepare statement ($stmt = SELECT * from User where id=?) and call it (excute($stmt, 27)). It speeds up your queries and helps prevent sql injection attacks.

g) Always use a transaction: begin; ; if(everythings_ok) commit; else rollback; . Always. Always. Always. Any time you modify a prod DB with raw sql. Always. Always.

Third: Data is your key asset. Your database is what truly makes your application valuable. While it may seem like some extra work to properly organize and design your tables, it will save you untold hours with having easier sql queries, reliable data, and a data model that can be queried in ways other than your core application. Doing a proper data model means that you can write new programs to work with the data and handle complex queries to analyze what you have collected as hard data to improve your app. If you write a simple, robust application to acquire data, there is no end to the number of clients you can write to work with it.

It just has to be accessible and logical.

Finally: Script your database test setups. You don’t actually write to the database with raw sql, so why should your test data get that way. Ideally, you would have modular functions and a test harness that can import your data directly through the functions that your application uses. For example, if you have a php script called register_account.php, that takes a bunch of parameters to register an account, then make that script only handle parsing parameters, passing them to a function that handles the actual transaction, and then shows the page. PHP doesn’t really encourage this kind of programming, unfortunately.

So: register_account parses parameters, and then calls db_layer.php#write_user(%userdata), which actually does the database write. If you set it up this way, you can make a script called create_user_data.php, that just calls the same method with a bunch of mock objects and sets up the DB. Its better to use a DB like sqlite because it is quick and easy to work with the schema before you commit to a full on RDBMS.

That being said, ill try to scrap together a quick model to show you what I am talking about.

(see part 2 tomorrow).

May 25, 2010

Daily #141: Marble Hornets.

Filed under: culture,daily — scrame @ 8:46 pm

In taking a break from the more technological crap, I figured I’d just do some good old fashioned nonsense, or maybe something not relating to programming, since it seems like I do that a lot during the day (even if i am actually trying to do something different than the day job).

Marble hornets is something that I stumbled on a few times recently, but never really took the plunge. It was definitely worth it. You may as well just start here:

http://marblehornets.wikidot.com/the-entries

Scroll to the bottom of the page and just go through the entries in suggested viewing order. Took me a while to find that they did that at the list, and I just started with the main entries, not realizing that there were other things that I was missing. Don’t bother reading too much of the synopsis until you get a little ways into it, since the summaries jump around, and until you have a bit of a backing it would just be incoherent.

Doing some digging around, its pretty impressive. The legend grew out of a photoshop contest here:

http://forums.somethingawful.com/showthread.php?threadid=3150591&userid=0&perpage=40&pagenumber=3#post361861415

It continues like that for a while. These guys just wanted to make a video controibution, but it ended up getting out of hand and taking 6-8 months to actually complete.

There are issues, of course: the acting is bad, the plot can be a little meandering and incoherent. But, in the end, they spend less than $500 putting the whole series together, and managed to do some really good work building an atmosphere, and a few clever camera tricks to make things a bit creepier.

Anyway, its well worth your time to troll through that pile of youtube videos. I find even knowing its origins and fictitiousness that it still manages to be quite creepy and credible.

Here’s the intro. You can find the rest of the entries in the link at the top.

May 24, 2010

Daily #140: Finally got it working!

Filed under: Technology,daily — scrame @ 4:38 pm

Finally got a basic http server, as well as a client passing an input and output message across an http post without any extra parameters. The code is on github here:

http://github.com/scrame/python-protobuf-http-demo

There might be a couple more technical days here, but I want to get back to better work soon.

May 23, 2010

Daily #139: Overriding simple http request.

Filed under: Technology,daily — scrame @ 9:41 pm

Still hacking away at this, I’m kind of surprised i can’t find a better implementation or tutorial for getting a simple server off the ground, though the one liner is quite handy. The actual method is just a do_GET, but I can’t find the spec for what it should be writing if its not going to be a file (I suppose i could stream the objects).

Anyway, pretty ill today, so I’m not going to burn too much energy on it. When i get a working example ill push it up to github or something.

May 22, 2010

Daily #138: More python-protobuf

Filed under: Technology,Uncategorized,daily — scrame @ 7:42 pm

[for some reason this didn't publish on time.]

Did some work through making the messages work. I forgot that within the package that there is a second step you have to do to get bindings for the language you are working with. This is not the compilation step because that is already its own issue. For java, they have a maven project, but if you are awesome like me, you are better than that, so you can just actually manually compile and jar it up, along with a separate source jar. Just like tar, without, yknow, the gzip options. Spent some time trying to come up with a basic request handler that can handle binary on both sides. The server itself is a bit of a time sink, because I am ultimately just going to need the client.

Older Posts »

Powered by WordPress