Archive for the ‘PHP’ Category

Improving on strip_tags

The Problem

PHP’s strip_tags() method will strip away tags but makes no attempt to introduce whitespace to separate content in adjacent tags. This is an issue with arbitrary HTML as adjacent block-level elements may not have any intermediate whitespace and simply stripping away the tags will incorrectly concatenate the textual content in the 2 elements.

For example, running strip_tags() on the following:

<div>the quick brown fox</div><div>jumped over the moon</div>

… will return:

the quick brown foxjumped over the moon

This is technically correct (we’re stripped away the <div> tags) but having no whitespace between “fox” and “jumped” means we’ve transformed the content such that we’ve lost semantic and presentational details.

The Solution

There’s 2 ways I can see to fix this behavior:

  • Pre-process the HTML content to ensure or introduce whitespace between block-level elements
  • Don’t use strip_tags() and utilize a method that better understands the need for spacing between elements

I’ll focus on the latter because that’s the avenue I went down and I didn’t consider pre-processing at the time.

Pulling together a quick-and-dirty parser, I wrote the following. It’s worth noting that still still doesn’t really consider what the tags are (e.g. whether they’re inline or block) but allows the caller to specify a string ($tagContentSeparator), typically some whitespace, that is inserted between the stripped away tags:

<?php class HTMLToPlainText { const STATE_READING_CONTENT = 1; const STATE_READING_TAG_NAME = 2; static public function convert(string $input, string $tagContentSeparator = " "): string { // the input string as UTF-32 $fixedWidthString = iconv('UTF-8', 'UTF-32', $input); // string within tags that we've found $foundContentStrings = []; // buffer for current content being read $currentContentString = ""; // flag to indicate how we should interpret what we're reading from $fixedWidthString // .. this is initially set to STATE_READING_CONTENT, as we assume we're reading content from the start, even // if we haven't encountered a tag (e.g. string that doesn't contain tags) $parserState = self::STATE_READING_CONTENT; // method to add a non-empty string to $foundContentStrings and reset $currentContentString $commitCurrentContentString = function() use (&$currentContentString, &$foundContentStrings) { if(strlen($currentContentString) > 0) { $foundContentStrings[] = trim($currentContentString); $currentContentString = ""; } }; // iterate through characters in $fixedWidthString // checking for tokens indicating if we're within a tag or within content for($i=0; $i<strlen($fixedWidthString); $i+=4) { // convert back to UTF-8 to simplify character/token checking $ch = iconv('UTF-32', 'UTF-8', substr($fixedWidthString, $i, 4)); if($ch === '<') { $parserState = self::STATE_READING_TAG_NAME; $commitCurrentContentString(); continue; } if($ch === '>') { $parserState = self::STATE_READING_CONTENT; continue; } if($parserState === self::STATE_READING_CONTENT) { $currentContentString .= $ch; continue; } } $commitCurrentContentString(); return implode($tagContentSeparator, $foundContentStrings); } }

Note that the to/from UTF-8 ↔ UTF-32 isn’t really necessary, I initially did the conversion as I was worried about splitting a multibyte character, but this isn’t possible given how the function reads the input string.

Now if we take the following HTML snippet:

<div>the quick brown fox</div><div>jumped over the moon</div>

… rendered in a browser, we get:

… with strip_tags() we get:

the quick brown foxjumped over the moon

… and with HTMLToPlainText::convert() (passing in “\n” for $tagContentSeparator), we get:

the quick brown fox jumped over the moon

The latter results in text that is semantically correct, as words in different blocks aren’t incorrectly joined. Presentationally we also get a more correct conversion but, the method isn’t really doing anything fancy here, this is due to the calling knowing a bit about the HTML snippet, how a browser would render it, and passing passing in “\n” for $tagContentSeparator.

Limitations / future work

The improvement here is that textual content is pretty preserved when doing a conversion, i.e. we don’t have to worry about textual elements being incorrectly concatenated. However, what I wrote is still lacking in 2 keys areas:

  • Generally, in terms of presentation, an arbitrary bit of HTML won’t map to what a user sees in a browser. To a certain degree this is an intractable problem, as presentation is based on browser defaults, CSS styles, etc. Also, there are things that simply don’t have a standard representation in plain-text (e.g. bold text, list items, etc.). However, there are cases where sensible defaults might make sense, e.g. stripping away <span> tags but putting newline between <p> tags.
  • Whitespace is trimmed from content within tags. This may or may not matter depending on application. In my case, I cared about the words and additional whitespace just added bloat even if it was more accurate to what was in the HTML.

PHP count() is O(1)

I was curious about the performance of PHP’s count() function a while back and whether it was worth it to store the result in a variable for repeated use. I discovered the following from this answer by FractalizeR on Stack Overflow:

PHP_FUNCTION(count) calls php_count_recursive(), which in turn calls zend_hash_num_elements() for non-recursive array, which is implemented this way:

ZEND_API int zend_hash_num_elements(const HashTable *ht)
{
IS_CONSISTENT(ht);

return ht->nNumOfElements;
}

So you can see, it’s O(1) for $mode = COUNT_NORMAL.

IMAP Pickup

An interesting little project I wanted to work on; I wanted to be able to pull attachments from emails in an IMAP mailbox and then download them. I wanted an IMAP solution instead of writing a script for the MTA as a script would be specific to the MTA software and not transferable to another server. In addition, there’s also the common case where you may simply not have access to the MTA.

The biggest help in putting this together and dealing with attachments was this blog post and this comment on the PHP docs. Information on doing this is a bit scattered and incomplete in many cases, likely because extracting attachments is somewhat difficult as email is a notoriously bad way to transfer files; the file data is base64 encoded and dumped in as part of the message body.

ImapPickup is the class that encapsulates all the necessary functionality,

class ImapPickup
{
protected $imapStream = null;

protected function findAttachments($part)
{
$partNum = -1;
$attachments = array();

$this->findAttachmentsRec($part, &$attachments, &$partNum, -1);

return $attachments;
}

protected function findAttachmentsRec($part, &$attachments, &$partNum, $partNumSub)
{
if (isset($part->parts))
{
foreach ($part->parts as $partOfPart)
{
$this->findAttachmentsRec($partOfPart, &$attachments, &$partNum, $partNumSub+1);
}
}
else
{

if (isset($part->disposition)){
if ($part->disposition == 'attachment') {
$attachments[] = array($part->dparameters[0]->value, $partNum, $partNumSub);
}
}
}

$partNum++;
}

public function getAttachmentContent($msgNum, $partNum)
{
$contents = imap_fetchbody($this->imapStream, $msgNum, $partNum, FT_UID);
return imap_base64($contents);
}

public function getAttachments($msgNum)
{
$struct = imap_fetchstructure($this->imapStream,$msgNum,FT_UID);
$attachments = $this->findAttachments($struct);

return $attachments;
}

public function getAttachmentsFromMessages($msgArray)
{
$msgIdToAttachmentsMap = array();

if ($msgArray)
{
foreach($msgArray as $msgId)
{
$attachments = $this->getAttachments($msgId);
if(!empty($attachments))
{
$msgIdToAttachmentsMap[$msgId] = $attachments;
}
}
}

return $msgIdToAttachmentsMap;
}

public function getMessages($searchQuery)
{
return imap_search($this->imapStream, $searchQuery, SE_UID);
}

public function connect($mailbox, $user, $password)
{
$this->imapStream = imap_open($mailbox, $user, $password);
}

public function disconnect()
{
imap_close(
$this->imapStream);
}

}

Here’s a little example of how it can be used. This will query all messages with “pickup::” in the subject line and print out the messageID of all messages with attachments, followed by the filenames of all attachments for that message.

$imapPickup = new ImapPickup();
$imapPickup->connect("{mail.hotspotdot.net:143}INBOX", "test@test.net", "pass123");

$messages = $imapPickup->getMessages("SUBJECT pickup::");
$attachments = $imapPickup->getAttachmentsFromMessages($messages);

foreach($attachments as $msgId => $attArr)
{
echo "<p>{$msgId} => ";

foreach($attArr as $attachment)
{
echo $attachment[0];
echo ",";
}

echo "</p>";
}

$imapPickup->disconnect();

The array for a single file attachment contains 3 entries:

  • [0] => filename
  • [1] => major part number
  • [2] => minor part number

getAttachments(), findAttachments(), and findAttachmentsRec() will return an array of such entries (or an empty array is there are no attachments). getAttachmentsFromMessages() will return a map from messageID => array of single attachments.

The part number (both major and minor) is needed to retrieve the contents of an attachment. For getAttachmentContent(), simply use the major number if the minor number is <= 0, or concatenate them with a period separating them (e.g. "2.3").

PHP session_start() “Node no longer exists”

I stumbled upon this error earlier today as I attempted to store the value of a SimpleXMLElement as a session variable. I was able to narrow down the issue thanks to this post on bytes.com.

According to a user post on the PHP site, this occurs because SimpleXML returns a reference to an object containing the node value, and you can’t store a reference as a session variable. The value must be dereferenced and copied which can be done by casting.

// Bad! $storageBoxSize = $xml->data->storage_box_size;

// Good! $storageBoxSize = (int)$xml->data->storage_box_size;

I find myself hating loosely-typed languages more and more.

PHP Array of MIME Types

Useful for when you need to set HTTP headers when serving file downloads. The mime_content_type function is deprecated and in my case was just returning an empty string. The recommended alternative, using PECL finfo_file, should work fine, but adding and compiling in a PECL extension just for this seems like overkill, especially as you have more control using an array. I’m also never crazy about adding dependencies unless they’re absolutely necessary.

This is from snipplr, but includes the image/png type which was, curiously, missing.

$mime_types = array(
"323" => "text/h323",
"acx" => "application/internet-property-stream",
"ai" => "application/postscript",
"aif" => "audio/x-aiff",
"aifc" => "audio/x-aiff",
"aiff" => "audio/x-aiff",
"asf" => "video/x-ms-asf",
"asr" => "video/x-ms-asf",
"asx" => "video/x-ms-asf",
"au" => "audio/basic",
"avi" => "video/x-msvideo",
"axs" => "application/olescript",
"bas" => "text/plain",
"bcpio" => "application/x-bcpio",
"bin" => "application/octet-stream",
"bmp" => "image/bmp",
"c" => "text/plain",
"cat" => "application/vnd.ms-pkiseccat",
"cdf" => "application/x-cdf",
"cer" => "application/x-x509-ca-cert",
"class" => "application/octet-stream",
"clp" => "application/x-msclip",
"cmx" => "image/x-cmx",
"cod" => "image/cis-cod",
"cpio" => "application/x-cpio",
"crd" => "application/x-mscardfile",
"crl" => "application/pkix-crl",
"crt" => "application/x-x509-ca-cert",
"csh" => "application/x-csh",
"css" => "text/css",
"dcr" => "application/x-director",
"der" => "application/x-x509-ca-cert",
"dir" => "application/x-director",
"dll" => "application/x-msdownload",
"dms" => "application/octet-stream",
"doc" => "application/msword",
"dot" => "application/msword",
"dvi" => "application/x-dvi",
"dxr" => "application/x-director",
"eps" => "application/postscript",
"etx" => "text/x-setext",
"evy" => "application/envoy",
"exe" => "application/octet-stream",
"fif" => "application/fractals",
"flr" => "x-world/x-vrml",
"gif" => "image/gif",
"gtar" => "application/x-gtar",
"gz" => "application/x-gzip",
"h" => "text/plain",
"hdf" => "application/x-hdf",
"hlp" => "application/winhlp",
"hqx" => "application/mac-binhex40",
"hta" => "application/hta",
"htc" => "text/x-component",
"htm" => "text/html",
"html" => "text/html",
"htt" => "text/webviewhtml",
"ico" => "image/x-icon",
"ief" => "image/ief",
"iii" => "application/x-iphone",
"ins" => "application/x-internet-signup",
"isp" => "application/x-internet-signup",
"jfif" => "image/pipeg",
"jpe" => "image/jpeg",
"jpeg" => "image/jpeg",
"jpg" => "image/jpeg",
"js" => "application/x-javascript",
"latex" => "application/x-latex",
"lha" => "application/octet-stream",
"lsf" => "video/x-la-asf",
"lsx" => "video/x-la-asf",
"lzh" => "application/octet-stream",
"m13" => "application/x-msmediaview",
"m14" => "application/x-msmediaview",
"m3u" => "audio/x-mpegurl",
"man" => "application/x-troff-man",
"mdb" => "application/x-msaccess",
"me" => "application/x-troff-me",
"mht" => "message/rfc822",
"mhtml" => "message/rfc822",
"mid" => "audio/mid",
"mny" => "application/x-msmoney",
"mov" => "video/quicktime",
"movie" => "video/x-sgi-movie",
"mp2" => "video/mpeg",
"mp3" => "audio/mpeg",
"mpa" => "video/mpeg",
"mpe" => "video/mpeg",
"mpeg" => "video/mpeg",
"mpg" => "video/mpeg",
"mpp" => "application/vnd.ms-project",
"mpv2" => "video/mpeg",
"ms" => "application/x-troff-ms",
"mvb" => "application/x-msmediaview",
"nws" => "message/rfc822",
"oda" => "application/oda",
"p10" => "application/pkcs10",
"p12" => "application/x-pkcs12",
"p7b" => "application/x-pkcs7-certificates",
"p7c" => "application/x-pkcs7-mime",
"p7m" => "application/x-pkcs7-mime",
"p7r" => "application/x-pkcs7-certreqresp",
"p7s" => "application/x-pkcs7-signature",
"pbm" => "image/x-portable-bitmap",
"pdf" => "application/pdf",
"pfx" => "application/x-pkcs12",
"pgm" => "image/x-portable-graymap",
"pko" => "application/ynd.ms-pkipko",
"pma" => "application/x-perfmon",
"pmc" => "application/x-perfmon",
"pml" => "application/x-perfmon",
"pmr" => "application/x-perfmon",
"pmw" => "application/x-perfmon",
"png" => "image/png",
"pnm" => "image/x-portable-anymap",
"pot" => "application/vnd.ms-powerpoint",
"ppm" => "image/x-portable-pixmap",
"pps" => "application/vnd.ms-powerpoint",
"ppt" => "application/vnd.ms-powerpoint",
"prf" => "application/pics-rules",
"ps" => "application/postscript",
"pub" => "application/x-mspublisher",
"qt" => "video/quicktime",
"ra" => "audio/x-pn-realaudio",
"ram" => "audio/x-pn-realaudio",
"ras" => "image/x-cmu-raster",
"rgb" => "image/x-rgb",
"rmi" => "audio/mid",
"roff" => "application/x-troff",
"rtf" => "application/rtf",
"rtx" => "text/richtext",
"scd" => "application/x-msschedule",
"sct" => "text/scriptlet",
"setpay" => "application/set-payment-initiation",
"setreg" => "application/set-registration-initiation",
"sh" => "application/x-sh",
"shar" => "application/x-shar",
"sit" => "application/x-stuffit",
"snd" => "audio/basic",
"spc" => "application/x-pkcs7-certificates",
"spl" => "application/futuresplash",
"src" => "application/x-wais-source",
"sst" => "application/vnd.ms-pkicertstore",
"stl" => "application/vnd.ms-pkistl",
"stm" => "text/html",
"svg" => "image/svg+xml",
"sv4cpio" => "application/x-sv4cpio",
"sv4crc" => "application/x-sv4crc",
"t" => "application/x-troff",
"tar" => "application/x-tar",
"tcl" => "application/x-tcl",
"tex" => "application/x-tex",
"texi" => "application/x-texinfo",
"texinfo" => "application/x-texinfo",
"tgz" => "application/x-compressed",
"tif" => "image/tiff",
"tiff" => "image/tiff",
"tr" => "application/x-troff",
"trm" => "application/x-msterminal",
"tsv" => "text/tab-separated-values",
"txt" => "text/plain",
"uls" => "text/iuls",
"ustar" => "application/x-ustar",
"vcf" => "text/x-vcard",
"vrml" => "x-world/x-vrml",
"wav" => "audio/x-wav",
"wcm" => "application/vnd.ms-works",
"wdb" => "application/vnd.ms-works",
"wks" => "application/vnd.ms-works",
"wmf" => "application/x-msmetafile",
"wps" => "application/vnd.ms-works",
"wri" => "application/x-mswrite",
"wrl" => "x-world/x-vrml",
"wrz" => "x-world/x-vrml",
"xaf" => "x-world/x-vrml",
"xbm" => "image/x-xbitmap",
"xla" => "application/vnd.ms-excel",
"xlc" => "application/vnd.ms-excel",
"xlm" => "application/vnd.ms-excel",
"xls" => "application/vnd.ms-excel",
"xlt" => "application/vnd.ms-excel",
"xlw" => "application/vnd.ms-excel",
"xof" => "x-world/x-vrml",
"xpm" => "image/x-xpixmap",
"xwd" => "image/x-xwindowdump",
"z" => "application/x-compress",
"zip" => "application/zip" );

Altering the file size limit for PHP uploads

Well, of course, you can edit php.ini to change the file size limit, but for cheap (typically, shared) hosting solutions the issue that pops us is that you don’t have access to it.

I was researching how I could do this w/o modifying php.ini and discovered a forum post that explain a few methods of how it could be done.

Method A (the one that worked for me):
Edit your .htaccess file (if you don’t have one in your web site’s public root directory, you should be able to just create one) and add the following:
php_value post_max_size 8M
php_value upload_max_filesize 8M
(of course, replace 8M with however many megabytes you wish to limit the file sizes of uploads by)

Method B (the one that didn’t work for me):
Use the ini_set function,
ini_set('post_max_size', '8M');
ini_set('upload_max_filesize', '8M');

Method C (the one I didn’t try):
Use a custom php.ini. Not sure exactly what would be involved here, but you’d likely have to programmatically make a copy of the existing php.ini file, modify the copy, and then place the copy in the directory of the script that needs it or in your web site’s public root directory.