Forum One on PHP
Subversion Checklist
We recently rolled out a new version control process, which we feel will better support our increasing project complexity.
Previously, our live sites used trunk, and our work took place on a single dev branch. This was problematic because it was sometimes difficult to separate the tested-and-ready-for-live code from the code still in development.
Now, we use multiple branches (one for each major enhancement and/or batch of bugfixes), and tag any finished changes. Tagging takes the guesswork out of figuring out what code is ready for live. When we need to cutover some changes, it's now as simple as switching to the newest tag.
Initial Buildout- Checkout a copy of trunk into your local working directory:
- svn checkout svn+ssh://path/to/project/trunk project
- Develop. Commit often. When the functionality is complete, create a branch:
- svn ci -m "Committing trunk changes"
- svn copy svn+ssh://path/to/project/trunk svn+ssh://path/to/project/branches/1-0-x
- Switch to the new branch:
- svn switch svn+ssh://path/to/project/branches/1-0-x
- Test. When finished, tag your new release:
- svn copy svn+ssh://path/to/project/branches/1-0-x svn+ssh://path/to/project/tags/1-0-0
- Merge the branch back into trunk:
- svn switch svn+ssh://path/to/project/trunk
- svn merge ^/trunk@HEAD ^/branches/1-0-x@HEAD
- Resolve any conflicts. Merging can sometimes be a tricky and time-consuming process. When in doubt, ask for help. There are also several GUI tools to make merging easier. Beyond Compare 3, WinMerge, and TortoiseSVN are some choices for Windows.
- Commit the merge
- svn commit -m "Merged branch 1-0-x back into trunk"
- Switch to the latest branch number:
- svn switch http://path/to/project/branches/1-3-x
- Develop, then test. When finished, tag your release with the next minor version:
- svn copy svn+ssh://path/to/project/branches/1-3-x svn+ssh://path/to/project/tags/1-3-2
- Repeat steps 5-6 of Initial Buildout.
- For major, long-running development that needs to happen simultaneously with smaller feature development, consider adding a new feature branch. Otherwise, work on trunk.
- Repeat steps 1-6 of Initial Buildout, remembering to update major version numbers (e.g. branches/1-1-x).
- As a good practice, make sure to do svn update before doing any commits.
- Switch to the latest tag:
- svn switch svn+ssh://path/to/project/tags/1-1-0
Speed Up Your MySQL Queries Using MD5 and Indexing
So I created a poorly constructed 'SELECT' query in a stored procedure and I was up late last night trying to figure out how I could best optimize it. After sitting in my 94 degree basement for a few hours, I thought of using hashed values via CONCAT and MD5 functions to create an index of multiple columns to improve 'SELECT' queries using 'JOINS.' The performance increase was instant after I made the necessary adjustments. To save others the time and headache, I'm going to share with you my solution and the resulting performance improvements.
So, I am dealing with two tables called "meetup_events" and "location" containing 1,000+ and 290,000+ records, respectively. Both tables contain information for:
- location name
- location address
- location city
- location state
- location zip
- location latitude
- location longitude
- location phone
location table Field Type Null Key Default Extra lid int(10) unsigned NO PRI NULL auto_increment name varchar(255) NO street varchar(255) NO additional varchar(255) NO city varchar(255) NO province varchar(16) NO postal_code varchar(16) NO country char(2) NO latitude decimal(10,6) NO 0.000000 longitude decimal(10,6) NO 0.000000 source tinyint(4) NO 0 is_primary tinyint(4) NO 0
The data fields that I need to extract include locations from the meetup_events table that are not in the location table. The query that I was using originally is:
SELECT DISTINCT
`me`.`venue_name`,`me`.`venue_address`, `me`.`venue_city`,`me`.`venue_state`,
`me`.`venue_zip`,`me`.`venue_phone`, `me`.`venue_lon`,`me`.`venue_lat`
FROM `meetup_events` `me` LEFT JOIN `location` `l`
ON( `me`.`venue_address` = `l`.`street`
AND `me`.`venue_state` = `l`.`province`
AND `me`.`venue_city` = `l`.`city`)
WHERE `me`.`venue_lat` <> 0 AND isnull(`l`.`name`)
Unfortunately, there is no immediate way of identifying a relationship between the location associated with each meetup_event and a record in the location table that uses an index because the meetup_events table contains data that is imported from a third-party source. This is where I took a wrong turn and JOIN'ed the tables using the address, state, and city fields.
...
ON( `me`.`venue_address` = `l`.`street`
AND `me`.`venue_state` = `l`.`province`
AND `me`.`venue_city` = `l`.`city`)
...
This didn't pose an issue when the number of records in the location table was smaller, but as the table grew it failed to scale and resulted in ERROR: query execution was interrupted. Not fun!
So after thinking about the minimal data that I would need to constuct a unique identifer for a given location, I decided to turn my attention to the location/venue name, latitude, and longitude data. This will take care of potential duplicates such as locations with mutliple venues like malls, multi-storied buildings, etc.
Now that I have figured out that aspect, I must store the data in a format that is optimial for the MySQL to quickly read and index. Without thought, I turned to the CONCAT() function to provide a single field to store the data and return a scalar value to store.
CONCAT(name, latitude, longitude);
To ensure that the data remains relatively short and unique, the MD5() function comes into play which will result in a hashed value of the concatenated value returned by CONCAT(name, latitude, longitude).
MD5(CONCAT(name, latitude, longitude));
Example: SELECT MD5(CONCAT('The Park at 14th', 38.879456, -76.985059));
e3dd1caa1253f2f5def9caadc3e474b7
Using this means of generating a unique value, both tables (meetup_events and location) are given a new field called, hashed_index of type varchar(100) with an INDEX.
meetup_events table (with new field) Field Type Null Key Default Extra id int(11) NO PRI NULL name varchar(100) NO NULL description text NO NULL time timestamp NO CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP group_id varchar(100) NO NULL group_name varchar(100) NO NULL group_photo_url varchar(200) NO NULL fee varchar(100) NO NULL how_to_find text NO NULL organizer_name varchar(100) NO NULL nid int(11) NO NULL status int(11) NO 0 venue_name varchar(100) NO NULL venue_address varchar(100) NO NULL venue_city varchar(100) NO NULL venue_state char(5) NO NULL venue_zip char(5) NO NULL venue_phone varchar(15) NO NULL venue_lon decimal(10,6) NO NULL venue_lat decimal(10,6) NO NULL hashed_index varchar(50) NO MUL NULLlocation table (with new field) Field Type Null Key Default Extra lid int(10) unsigned NO PRI NULL auto_increment name varchar(255) NO street varchar(255) NO additional varchar(255) NO city varchar(255) NO province varchar(16) NO postal_code varchar(16) NO country char(2) NO latitude decimal(10,6) NO 0.000000 longitude decimal(10,6) NO 0.000000 source tinyint(4) NO 0 is_primary tinyint(4) NO 0 hashed_index varchar(50) NO MUL NULL
Here is the query rewritten and optimized:
SELECT DISTINCT
`me`.`venue_name`,`me`.`venue_address`,
`me`.`venue_city`,`me`.`venue_state`,
`me`.`venue_zip`,`me`.`venue_phone`,
`me`.`venue_lon`,`me`.`venue_lat`
FROM `meetup_events` `me` LEFT JOIN `location` `l`
USING( hashed_index)
WHERE `me`.`venue_lat` <> 0 AND isnull(`l`.`name`);
If you have any multiple column indexes within any of your tables, try indexing a hashed version of those fields concatenated, and then test to see if this results in a performance increase. Good luck!
WordPress, Thesis, and GPL Heartache
Several battles have been brewing from within the WordPress community.
It's WordPress, not Wordpress dangit!
A couple weeks ago, a snippet of code got sneaked into WordPress 3.0. The function, capital_P_dangit, is a filter that replaces "Wordpress" with "WordPress" throughout one's site.
While this doesn't seem like a big deal, this change infuriated some users. They claimed that WordPress does not own a user's content, and therefore, such a move (without first explicitly asking to do so) infringes on the user's expressive freedoms.
The filter itself doesn't bother me personally. What bothers me is the lack of review. This change produced a rarely-occuring permalink bug in WordPress 3.0. This "feature" was added with barely any discussion, and hopefully isn't an indication of things to come.
Themes and Plugins Must be GPL
WordPress is licensed under the GPL (v2). It's been clearly stated that all WordPress plugins and themes should be licensed under the GPL as well.
As written on Perpetual Beta, "Anyone who gets a copy of a premium theme then has the right to freely distribute it or modify it virtually without restriction (expect, of course, those restrictions found in the GPL itself)".
The popular paid theme, Thesis, does not adhere to the GPL license. The authors refuse to.
The GPL (version 2)
The GPL ensures that software is freely redistributable, but not necessarily without a price. You've probably heard the phrase "free as in free speech, not as in free beer." GPL software is meant to protect its users (and the community), not the authors. Users have four essential freedoms:
- The freedom to run the program, for any purpose (freedom 0).
- The freedom to study how the program works, and change it to make it do what you wish (freedom 1). Access to the source code is a precondition for this.
- The freedom to redistribute copies so you can help your neighbor (freedom 2).
- The freedom to distribute copies of your modified versions to others (freedom 3). By doing this you can give the whole community a chance to benefit from your changes. Access to the source code is a precondition for this.
Thesis is Special, Right?
The authors, particularly Chris Pearson, fear that switching to GPL will de-value their product. It'll allow for knock-off themes that are strikingly similar to Thesis — but for a fraction of the cost.
Under the GPL license, free distribution is encouraged -- as long as the product (or derivative) remains under the GPL.
The other argument is that although Thesis requires WordPress to function, the vast majority of the theme's code is totally original. In that respect, some argue that it may not fall into the "derivative work" category, which mandates the GPL stamp.
Chris claimed during a live call that WordPress cannot lawfully force Thesis to adopt the GPL. He coined the GPL as a "flimsy and unenforceable license".
No, says WordPress
The creator of WordPress, Matt Mullenweg, argued that licensing Thesis under the GPL was "the right thing to do." He argued that commercial GPL themes (and plugins) can still be profited from, and there are other effective models at earning a steady income (such as through offering services and/or paid support).
Most notably, those supporting Matt's stance argue, "If you don't agree with the GPL, then don't develop for WordPress."
Although both sides have decent points, I was a little disappointed by all the smearing via Twitter. At one point, Matt essentially told followers not to use Thesis, and (in response to a Thesis vulnurability claim), "This is what happens when non-coders think they can code." This isn't saying that Chris wasn't without flaws, as his arrogance shined during the live call: "I am one of the three most important people in WordPress."
Conclusion
This is an ongoing battle, with both sides clinging to their interpretations of the GPL license. I don't see either side giving up very easily, so this dispute could likely find its way into court. I'm excited to see how this unfolds, as a decision will affect not just WordPress, but many other open platforms (Drupal, Joomla, etc.) as well.
What are your thoughts?
Life with Quercus: PHP Via Java
There has been a lot of (virtual) ink spilled over HipHop, Facebook's new PHP-to-C compiler; but a similar approach has flown a little more under the radar. I'm talking about Quercus, a PHP interpreter/compiler written in Java by Caucho, the makers of the Resin app server.
Quercus is a servlet that can be deployed in any J2SE/J2EE servlet container — Resin, Tomcat, JBoss, WebSphere, etc. — and can run PHP applications deployed in the same webapp. Of course the question is why would you use Quercus when you can use Zend's PHP interpreter?
Unlike HipHop where the biggest reason to use it is performance, the case for Quercus is slightly more nuanced. Quercus has more capabilities than HipHop out of the box; for instance you can drop Drupal, WordPress, Joomla, MediaWiki, Gallery2 and more in Quercus and they run. Need to use eval (I'm looking at you, Drupal)? No problem.
Similarly, unlike HipHop which contains a built-in web server, Quercus is a servlet which provides you more options on how you run it. And because we're running on Java we have access to all the tasty goodies that people have written for it. Need asynchronous messaging? Love JMS? Multi-node caching? Terracotta and Ehcache rocks — interestingly enough someone already did this with Drupal. The list goes on and on...Apache Axis is great for web services, GraniteDS does almost as much as Adobe LifeCycle to connect with Flex.
When you look at the documentation for Quercus, they talk about performance being at least comparable to Apache/PHP, and they're mostly correct. Deploying a fairly intensive community Drupal site (OG, Spaces, lots of Views) on Resin 4 and testing with Apache Bench (AB) on a workstation puts Resin running only slightly slower than Apache, at least up to 50 concurrent users.
There is some room to grow on the Resin side as it was running pure Java, no JNI, and anecdotal testing indicates that this provides some 10-12 percent performance benefit. So out of the box this would put Resin processing requests at about the same speed as Apache.
We were testing the community version of Resin and Quercus, the Professional version has the ability to compile PHP to Java using the same sort of static analysis techniques used by HipHop and, at least according to benchmarks published by Caucho (PDF), show an increase in performace of Drupal, WordPress, and other common PHP applications, ranging from ~17 percent in phpBB to almost 110 percent for WordPress.
So, what's not to love, right? We can take existing PHP applications, move them over painlessly and get both increased performance and capability.
Sadly, things are not that clear-cut. Going back to the example below of the 50 concurrent connections. While performance was roughly equivalent, the effect on system resources was not, Resin used considerably more memory than the comparable test on Apache. Another issue is lack of sophisticated URL rewriting on some servlet containers. Resin supports enough of the functionality of mod_rewrite to make Drupal work, but others do not. Getting Drupal clean URLs in Tomcat, for example, is difficult. Other uses of mod_rewrite may not work at all without significant effort.
And perhaps the biggest downside for Quercus is inherent in what it is. It's Java. For organizations that have made a significant investment in PHP the code is not all there is. You have systems tuned for the LAMP stack. You have system administrators that are knowledgable and experienced in maintaining those systems. You build solutions that use PHP, and while the additional functionality of Java is nice, what do you do with it? But like HipHop, the very existance of Quercus is encouraging. While I don't have a project right now targeting Quercus, and I may never, the fact that companies like Facebook and Caucho are innovating and adding new options, new features makes this an exciting time.
Salsa Rules! A New Drupal Module for Democracy in Action
One frequent request we get from clients is to integrate their Drupal web site user profiles with a Customer Relationship Management (CRM) system so that their users' information can be used for email messaging and other purposes.
We've worked with a variety of CRMs including CiviCRM, SalesForce, Avectra netFORUM, Convio, and Democracy In Action's Salsa.
Salsa has a focus on advocacy activities including creating and signing petitions, generating letters to the editor, and fundraising. so it's a natural fit for a lot of our work with nonprofits, NGOs, and other public policy organizations.
When built EarthDay.org, we created a social networking site for users to register and commit to the cause, and at the end everything had to flow into Salsa to serve as one master list of events, petitions, and individuals for the Earth Day Network.
In order to make the process as flexible as possible, we based our work on the Drupal Rules module, adding actions to create Supporters, Events, Petitions, Campaigns and Groups from nodes in Drupal.
Below is a screencast showing some of the potential. I take you through the process of creating a new user with a Content Profile node and creating a Supporter in Salsa. Then, I create an Event in Salsa from one in Drupal and have a user register for that Event. The result is that having their Supporter record is correctly associated with the Event in Salsa.
This new module is available on drupal.org at www.drupal.org/project/salsa_rules. I hope you find it useful to your work.
MongoDB: A Simple User Directory
In my last post, I glossed over MongoDB from a very high level. While it's useful to know how MongoDB works, I didn't cover exactly how to use it.
There are a growing number of NoSQL systems, and one of MongoDB's greatest strengths is its ease-of-use. It takes a couple of minutes to install the server, and commands are given using the familiar JSON syntax.
Setting up MongoDBI'm running Windows, but installing on a Mac or *nix system shouldn't be much different. After downloading the appropriate package, copy the mongdb folder into c:/data/db/. Then, using your shell (cmd prompt), browse to c:/data/db/mongodb/bin and execute the server executable, mongod.exe:
Voila! The MongoDB server is now up and running. Let's fire up a separate window, this time for an instance of the MongoDB client. The client is mongo.exe, also in the bin directory.
Setting up a User DirectoryWe want to create a simple user directory, and want to store the following information about each user:
name age languagesUnlike with traditional SQL systems, we don't have to create a database schema. You simply create a JSON object containing the data you want, and MongoDB saves it as a document. By default, MongoDB will use the test database. Let's change that to a new database called users:
use users;I know, it was pretty intense. Once you've wiped the sweat from your brow, let's add some data to our new database:
var data = [ {name: "John Doe", age: 28, languages: ["English", "Arabic"]}, {name: "Mike Smith", age: 35, languages: ["English"]}, {name: "Sally Mae", age: 41, languages: ["Italian", "French", "Arabic"]} ];To add these people to our database, we run the following command:
db.users.save(data); Getting all usersTo find all items in the collection, you'd run db.users.find().
This returns our users. If there are more than 10 users, you'll need to enter the "it" command to iterate through the rest of the results. Fortunately, the MongoDB PHP extension makes iterating a lot less tedius.
Getting specific usersYou're able to enter specific parameters into the find() function to retrieve filtered results.
To find "John Doe":
db.users.find({name: "John Doe"});To find anyone 41 years old:
db.users.find({age: 41});To find anyone who speaks Arabic:
db.users.find({languages: "Arabic"});MongoDB also supports comparison operators (<, >, <=, >=, etc), but the format takes a bit to get used to. For example, let's find users between 30 and 40 years old:
db.users.find({age: {'$gte': 30}, {'$lte': 40}});NOTE: "Less than" is '$lt', whereas "Less than or equal" is '$lte'.
SnafusWhen using MongoDB, you'll run into difficulties that were non-issues in MySQL. One example is using "OR" when building queries. In MySQL, it's quick and painless to use OR when finding records. Let's find users <= 30 or >= 40 years old:
MySQL
SELECT * FROM users WHERE age <= 30 OR age >= 40;MongoDB
The "OR" operator was just added to a nightly build of MongoDB, but is unavailable in the current stable version. It's still possible, but it requires a custom '$where' function:
db.users.find({'$where': function() { return this.age <= 30 || this.age >= 40; }}); ConclusionMongoDB seems to be a pretty powerful tool for document storage. It's fairly robust, extremely fast (from what I can tell), and easy to use.
I cannot stress enough that MongoDB (or any NoSQL solution) is not a substitute for all things SQL. If your application has transactional or relational requirements, you're better off sticking with SQL.
MongoDB (NoSQL): An Architectural Overview
MongoDB is one of the forerunners in the NoSQL movement, an effort to promote non-relational, schema-free data stores. It lacks any table JOINs, which avoids performance bottlenecks seen with traditional SQL servers.
Why NoSQL?NoSQL servers are not meant to replace traditional SQL servers. They are meant to handle problems without heavy transactional requirements, but with the potential to massively scale if needed. They work well for providing quick access to large number of documents, serving pages on high-traffic web sites, or delivering streaming media.
Document Storage: Database > Collection > DocumentMongoDB is a document database. Documents are stored in collections, and collections are in turn stored in a database. A collection is similar to a table in MySQL (it's a named group of documents), but a collection lacks any schema.
DocumentsA document contains the actual data, and is stored as a binary JSON object (called BSON). A single document has a storage limit of 4 MB. Queries are expressed as JSON-style objects, making it pretty painless to save and retrieve data:
var data = {first_name: "John", last_name: "Doe", lottery: [42, 16, 29]};
db.users.save(data);
Files can be stored by MongoDB by using the GridFS specification. Since a document can be a maximum of 4 MB, GridFS works by splitting a large file into (usually 256k) chunks before storing it into a files collection. Unfortunately, MongoDB doesn't do any sort of automatic cleanup in the event of a processing error (your collection would be stuck with fragments of the corrupted file).
Horizontal ScalabilityMongoDB uses (auto) sharding, not replication, as a way of achieving high scalability. Sharding essentially involves "breaking your database down into smaller chunks called shards and spreading those across a number of distributed servers. With Mongo, we tend to think of replication as a way to gain reliability/failover rather than scalability. (See this article for more.)
Performance- There's a client driver per language
- CouchDB uses REST
- Documents have a maximum size of 4MB
- This is not changeable
- Memory-mapped files for data storage
- This means that data is limited to around 2GB on 32-bit systems
- MongoDB stores as much data in RAM as possible
- Update-in-place (instead of MVCC, as in CouchDB)
- Written in C++
Stay tuned for my next post, which will cover the creation of a simple web app using Mongo and the PHP driver.
