Automatic content moderation with 'validates_text_content'

Over the last 6 months I have been working on a number of community websites, all of which aim to make it as easy as possible for people to contribute. No account creation, no captchas, no hassle. This has brought a particular problem to my attention: one of the greatest differences between a valuable online community and a horrible one is the level of time and thought that goes into the comments/content produced by the community members. As an example of this consider the comments written on Hacker News versus the ones on YouTube.

A big issue for many online communities is the fact that distinguishing between valuable content and trash content in an automatic fashion is very hard. Most communities solve this problem by enlisting the help of human moderators in addition to a member-accessible content flagging system.

But what if human moderation is not an option? You have a small community that you want to grow quickly but you can't afford to devote untold hours moderating content or lots of money paying someone else to moderate the content for you. Luckily some simple statistical analysis can probably come to the rescue!

How many words does it take to say something of worth? Surely content that is only 20 characters long cannot be contributing much? How about content that has no capital letters? Or no punctuation? There are many patterns inherent in any written language that we can leverage to determine whether or not textual content meets a certain level of quality. Text that has been well though out will be laid out into sentences and paragraphs, will be well punctuated and consist mainly of words that are listed in a dictionary somewhere, rather than short abbreviations.

With this in mind we can dig up some simple relevant facts about how the English language is structured:

Sentences start with a capital letter and end with one of a variety of punctuation marks.
The letter 'E' is the most commonly used letter in English words, occurring roughly 12% of the time. [source]
The average length of an English word is roughly 5.1 characters. [source]

We can also pull in a relevant point from simple common sense etiquette:

Writing exclusively in BLOCK CAPS is a sign of either rudeness or laziness.

Without inventing strong A.I. or resorting to complex Bayesian Filtering we can still come up with some simple rules that have a very good chance of identifying useful, thought out content:

The text should start with a capital letter.
The text should contain at least one punctuation mark (exclamation point, question mark or period).
The text should contain at least one 'e' for every 30 characters (~3%)
The text should contain at least one space for every 20 characters (5%)
The text should be at least 25% lowercase.

These rules are deliberately quite conservative in order to minimize the likelihood of false positives, but in practice they have proven extremely effective. This exact rule-set has been in production on two of my projects for nearly six months and in both cases I have seen many examples of people having their throw-away comments rejected only for them to try again and come up with something that ended up contributing far more.

To make it easier to implement this idea in my other projects I created a Rails plugin that adds a simple validation method to ActiveRecord, 'validates_text_content'.

To try it out simply go and check it out on GitHub:
http://github.com/aarongough/validates_text_content