PHP | semi/signal

Archive for the ‘PHP’ Category

Improving on strip_tags (part 2)

Feb 26 2023 · PHP

Whitespace and tags

Previously, I looked at improving the functionality of strip_tags such that words across tags are not mashed together. The method I derived works well enough but it’s limited in that all tags are treated the same way and all whitespace separators are the same. I wanted to see if I could improve the method a bit more to address these limitations; that is, introducing whitespace based on the type of tag encountered instead of injecting whitespace after stripping away a tag.

For example, when dealing with inline tags, whitespace should be preserved:

This bit of HTML:
<span>the quick brown fox </span><span>jumped over the moon</span>
… should produce:
the quick brown fox jumped over the moon

Alternatively, when dealing with block-level tags, a newline should be injected:

This bit of HTML:
<div>the quick brown fox</div><div>jumped over the moon</div>
… should produce:
the quick brown fox jumped over the moon

Note that we’re simply talking about common/expected browser behavior from what’s thought of as inline-level or block-level tags. In reality, this categorization isn’t really part of the HTML standard anymore and layout behavior is relegated determined by CSS. From MDN:

That said, when looking at arbitrary HTML content, I still think “block” vs. “inline” is a useful distinction, at least insofar as inferring default or common behavior.

The special case

The <br> tag presents a special case. While it’s classified as an inline element, <br> represents whitespace that is generally similar to that of a block-level element (e.g. a newline). In implementation this is simple to handle but does introduce a tiny bit of additional complexity.

Looking at the high-level transformations needed, we get the following:

Inline-level tags → strip away (no action needed, don’t alter any existing whitespace within tag contents)
Block-level tags → strip away, replace with newline
<br> tags → strip away, replace with newline

Code

Reworking the convert() method from the previous post, we get the following:

class HTMLToPlainText
{
    const BLOCK_LEVEL_ELEMENTS = [
        "address",
        "article",
        "aside",
        "blockquote",
        "details",
        "dialog",
        "dd",
        "div",
        "dl",
        "dt",
        "fieldset",
        "figcaption",
        "figure",
        "footer",
        "form",
        "h1",
        "h2",
        "h3",
        "h4",
        "h5",
        "h6",
        "header",
        "hgroup",
        "hr",
        "li",
        "main",
        "nav",
        "ol",
        "p",
        "pre",
        "section",
        "table",
        "ul"
    ];

    const INLINE_LEVEL_ELEMENTS_THAT_PRODUCE_NEWLINE = [
        "br",
    ];

    const STATE_READING_CONTENT = 1;
    const STATE_READING_TAG_NAME = 2;

    static public function convert(string $input, string $blockContentSeparator = "\n"): string
    {
        // the input string as UTF-32
        $fixedWidthString = iconv('UTF-8', 'UTF-32', $input);

        // string within tags that we've found
        $output = "";

        // buffer for current/last tag name read
        $currentTagName = "";
        $currentTagIsClosing = null;

        // buffer content in the current tag being read
        $contentInCurrentTag = "";

        // flag to indicate how we should interpret what we're reading from $fixedWidthString
        // .. this is initially set to STATE_READING_CONTENT, as we assume we're reading content from the start, even
        // if we haven't encountered a tag (e.g. string that doesn't contain tags)
        $parserState = self::STATE_READING_CONTENT;

        $flushCurrentToOutput = function() use (&$output, &$contentInCurrentTag, &$currentTagName, &$currentTagIsClosing, &$blockContentSeparator) {
            // handle inline tags, which produce a newline (e.g. <br>)
            // .. not that these can be empty (<br>) or self-closing (<br/>)
            if(in_array(strtolower($currentTagName), self::INLINE_LEVEL_ELEMENTS_THAT_PRODUCE_NEWLINE)) {
                $output .= $contentInCurrentTag . $blockContentSeparator;
            } else {
                // append $blockContentSeparator if we're at the *opening or closing* of a block-level element
                // (for inline element, leave content as-is)
                if (in_array(strtolower($currentTagName), self::BLOCK_LEVEL_ELEMENTS)) {
                    $output .= $contentInCurrentTag . $blockContentSeparator;
                } else {
                    $output .= $contentInCurrentTag;
                }
            }

            // reset
            $contentInCurrentTag = "";
            $currentTagIsClosing = null;
            $currentTagName = "";
        };

        // iterate through characters in $fixedWidthString
        // checking for tokens indicating if we're within a tag or within content
        for($i=0; $i<strlen($fixedWidthString); $i+=4) {
            // convert back to UTF-8 to simplify character/token checking
            $ch = iconv('UTF-32', 'UTF-8', substr($fixedWidthString, $i, 4));

            if($ch === '<') {
                $flushCurrentToOutput();
                $parserState = self::STATE_READING_TAG_NAME;
                continue;
            }

            if($ch === '>') {
                $flushCurrentToOutput();
                $parserState = self::STATE_READING_CONTENT;
                continue;
            }

            if($parserState == self::STATE_READING_TAG_NAME && $ch == '/') {
                $currentTagIsClosing = true;
                continue;
            }

            if($parserState == self::STATE_READING_TAG_NAME) {
                $currentTagName .= $ch;
                continue;
            }

            if($parserState === self::STATE_READING_CONTENT) {
                $contentInCurrentTag .= $ch;
                continue;
            }
        }

        $flushCurrentToOutput();

        return trim($output, $blockContentSeparator);
    }
}

Testing

Throwing some arbitrary bits of HTML at this function seems to indicate that the method works correctly but, a method like this, really calls for some form of automated testing. I could derive test cases from the function logic, and this is what’s typically done when testing some arbitrary method, but this approach is biased and limited here. Biased in that I’d be looking at the function and coming up with test cases based upon my experiences (what I’ve encountered and where I think there may be potential issues). Limited in that I’d likely only come up with a handful of test cases unless I invested a significant chunk of time into compiling a comprehensive set of cases; HTML has relatively few building blocks but, given the number of different ways those blocks can be combined and arranged, we end up with a fairly large number of permutations. What would really be effective here is testing with a large and varied corpus of test cases, mappings of HTML snippets to plain text representations; i.e. data-driven testing. It’s usually hard to generate or find data for such testing but the PHP repository has a number of test cases for strip_tags() that can be leveraged:

strip_tags_basic1.phpt has some good baseline tests (HTML tags, PHP tags, tags with attributes, HTML comments, etc.)
strip_tags_basic2.phpt has a good test case (different tags + mix of block and inline elements + PHP tags) but is really testing the allowed_tags_array argument to strip_tags(), which I forgot was a thing and didn’t consider in my method

Beyond the test cases in these 2 files, there are other good cases scattered in the repo, seemingly tied to specific bugs encountered (e.g. bug #53319, which involves handling of “<br />” tags) but they can be hard to locate given the organization or lack thereof of the test files. In any case, it’s great having this data to work with and there were some issues that surfaced when I began subjecting my code to some of these test (e.g. the content separator for block-level elements needing to be attended at the point of both the opening and closing tags, not just the closing tag).

Implementation-wise, testing is mainly encoding the test case in a map and assert that the actual result matches expectations:

$testCases = [
    "<html>hello</html>" => "hello",
    "<?php echo hello ?>" => "",
    "<? echo hello ?>" => "",
    "<% echo hello %>" => "",
    "<script language=\"PHP\"> echo hello </script>" => " echo hello ",
    "<html><b>hello</b><p>world</p></html>" => "hello\nworld",
    "<html><!-- COMMENT --></html>" => "",
    "<html><p>hello</p><b>world</b><a href=\"#fragment\">Other text</a></html><?php echo hello ?>" => "hello\nworldOther text",
    "<p>hello</p><p>world</p>" => "hello\n\nworld",
    '<br /><br />USD<input type="text"/><br/>CDN<br><input type="text" />' => "USD\nCDN",
];

foreach ($testCases as $html => $expectedPlainText) {
    $actualPlainText = HTMLToSearchableText::convert_ex($html);

    echo "TEST: " . $html . "\n";
    echo "EXPECTED: " . $expectedPlainText . "\n";
    echo "ACTUAL: " . $actualPlainText . "\n";
    echo "----\n";

    assert($actualPlainText === $expectedPlainText);
}

Testing is still limited here. I’ve love to simply have a large batch of test cases to throw at the function but something like that is not readily available.

Limitations / future work

The new convert() method is more robust but there’s still some key limitations when compared to the strip_tags() function:

PHP’s strip_tags() is actually a lot more robust when it comes to invalid/malformed HTML content, as the tests in strip_tags.phpt demonstrate
Preserving certain tags (as with the allowed_tags_array argument) wasn’t considered

Also, whitespace/separators produced from <br> elements at the beginning or end of any inputted HTML is stripped away. I don’t think this is correct as browsers preserve whitespace from <br> elements and don’t collapse them as with empty block-level elements.

block-leveldata-driven testingHTMLinline-levelPHPsoftware testingstrip_tagsunit testing

Improving on strip_tags

Aug 13 2022 · PHP

The Problem

PHP’s strip_tags() method will strip away tags but makes no attempt to introduce whitespace to separate content in adjacent tags. This is an issue with arbitrary HTML as adjacent block-level elements may not have any intermediate whitespace and simply stripping away the tags will incorrectly concatenate the textual content in the 2 elements.

For example, running strip_tags() on the following:

<div>the quick brown fox</div><div>jumped over the moon</div>

… will return:

the quick brown foxjumped over the moon

This is technically correct (we’re stripped away the <div> tags) but having no whitespace between “fox” and “jumped” means we’ve transformed the content such that we’ve lost semantic and presentational details.

The Solution

There’s 2 ways I can see to fix this behavior:

Pre-process the HTML content to ensure or introduce whitespace between block-level elements
Don’t use strip_tags() and utilize a method that better understands the need for spacing between elements

I’ll focus on the latter because that’s the avenue I went down and I didn’t consider pre-processing at the time.

Pulling together a quick-and-dirty parser, I wrote the following. It’s worth noting that still still doesn’t really consider what the tags are (e.g. whether they’re inline or block) but allows the caller to specify a string ($tagContentSeparator), typically some whitespace, that is inserted between the stripped away tags:

<?php

class HTMLToPlainText
{
    const STATE_READING_CONTENT = 1;
    const STATE_READING_TAG_NAME = 2;

    static public function convert(string $input, string $tagContentSeparator = " "): string
    {
        // the input string as UTF-32
        $fixedWidthString = iconv('UTF-8', 'UTF-32', $input);

        // string within tags that we've found
        $foundContentStrings = [];

        // buffer for current content being read
        $currentContentString = "";

        // flag to indicate how we should interpret what we're reading from $fixedWidthString
        // .. this is initially set to STATE_READING_CONTENT, as we assume we're reading content from the start, even
        // if we haven't encountered a tag (e.g. string that doesn't contain tags)
        $parserState = self::STATE_READING_CONTENT;

        // method to add a non-empty string to $foundContentStrings and reset $currentContentString
        $commitCurrentContentString = function() use (&$currentContentString, &$foundContentStrings) {
            if(strlen($currentContentString) > 0) {
                $foundContentStrings[] = trim($currentContentString);
                $currentContentString = "";
            }
        };

        // iterate through characters in $fixedWidthString
        // checking for tokens indicating if we're within a tag or within content
        for($i=0; $i<strlen($fixedWidthString); $i+=4) {
            // convert back to UTF-8 to simplify character/token checking
            $ch = iconv('UTF-32', 'UTF-8', substr($fixedWidthString, $i, 4));

            if($ch === '<') {
                $parserState = self::STATE_READING_TAG_NAME;
                $commitCurrentContentString();
                continue;
            }

            if($ch === '>') {
                $parserState = self::STATE_READING_CONTENT;
                continue;
            }

            if($parserState === self::STATE_READING_CONTENT) {
                $currentContentString .= $ch;
                continue;
            }
        }

        $commitCurrentContentString();

        return implode($tagContentSeparator, $foundContentStrings);
    }
}

Note that the to/from UTF-8 ↔ UTF-32 isn’t really necessary, I initially did the conversion as I was worried about splitting a multibyte character, but this isn’t possible given how the function reads the input string.

Now if we take the following HTML snippet:

<div>the quick brown fox</div><div>jumped over the moon</div>

… rendered in a browser, we get:

… with strip_tags() we get:

the quick brown foxjumped over the moon

… and with HTMLToPlainText::convert() (passing in “\n” for $tagContentSeparator), we get:

the quick brown fox
jumped over the moon

The latter results in text that is semantically correct, as words in different blocks aren’t incorrectly joined. Presentationally we also get a more correct conversion but, the method isn’t really doing anything fancy here, this is due to the calling knowing a bit about the HTML snippet, how a browser would render it, and passing passing in “\n” for $tagContentSeparator.

Limitations / future work

The improvement here is that textual content is pretty preserved when doing a conversion, i.e. we don’t have to worry about textual elements being incorrectly concatenated. However, what I wrote is still lacking in 2 keys areas:

Generally, in terms of presentation, an arbitrary bit of HTML won’t map to what a user sees in a browser. To a certain degree this is an intractable problem, as presentation is based on browser defaults, CSS styles, etc. Also, there are things that simply don’t have a standard representation in plain-text (e.g. bold text, list items, etc.). However, there are cases where sensible defaults might make sense, e.g. stripping away <span> tags but putting newline between <p> tags.
Whitespace is trimmed from content within tags. This may or may not matter depending on application. In my case, I cared about the words and additional whitespace just added bloat even if it was more accurate to what was in the HTML.

EDIT: See part 2 on addressing these limitations and making the code more robust.

HTMLPHPstrip_tags

PHP count() is O(1)

Oct 10 2013 · PHP

I was curious about the performance of PHP’s count() function a while back and whether it was worth it to store the result in a variable for repeated use. I discovered the following from this answer by FractalizeR on Stack Overflow:

PHP_FUNCTION(count) calls php_count_recursive(), which in turn calls zend_hash_num_elements() for non-recursive array, which is implemented this way:
ZEND_API int zend_hash_num_elements(const HashTable *ht) { IS_CONSISTENT(ht); return ht->nNumOfElements; }

So you can see, it’s O(1) for $mode = COUNT_NORMAL.

arraycountperformancePHPtime complexity

IMAP Pickup

May 1 2011 · PHP

An interesting little project I wanted to work on; I wanted to be able to pull attachments from emails in an IMAP mailbox and then download them. I wanted an IMAP solution instead of writing a script for the MTA as a script would be specific to the MTA software and not transferable to another server. In addition, there’s also the common case where you may simply not have access to the MTA.

The biggest help in putting this together and dealing with attachments was this blog post and this comment on the PHP docs. Information on doing this is a bit scattered and incomplete in many cases, likely because extracting attachments is somewhat difficult as email is a notoriously bad way to transfer files; the file data is base64 encoded and dumped in as part of the message body.

ImapPickup is the class that encapsulates all the necessary functionality,

class ImapPickup
{
    protected $imapStream = null;

    protected function findAttachments($part)
    {
        $partNum = -1;
        $attachments = array();

        $this->findAttachmentsRec($part, &$attachments, &$partNum, -1);

        return $attachments;
    }

    protected function findAttachmentsRec($part, &$attachments, &$partNum, $partNumSub)
    {
        if (isset($part->parts))
        {
            foreach ($part->parts as $partOfPart)
            {
                $this->findAttachmentsRec($partOfPart, &$attachments, &$partNum, $partNumSub+1);
            }
        }
        else
        {

            if (isset($part->disposition)){
                    if ($part->disposition == 'attachment') {
                        $attachments[] = array($part->dparameters[0]->value, $partNum, $partNumSub);
                    }
                }
        }

        $partNum++;
    }

    public function getAttachmentContent($msgNum, $partNum)
    {
        $contents = imap_fetchbody($this->imapStream, $msgNum, $partNum, FT_UID);
        return imap_base64($contents);
    }

    public function getAttachments($msgNum)
    {
        $struct = imap_fetchstructure($this->imapStream,$msgNum,FT_UID);
        $attachments = $this->findAttachments($struct);

        return $attachments;
    }

    public function getAttachmentsFromMessages($msgArray)
    {
        $msgIdToAttachmentsMap = array();

        if ($msgArray)
        {
            foreach($msgArray as $msgId)
            {
                $attachments = $this->getAttachments($msgId);
                if(!empty($attachments))
                {
                    $msgIdToAttachmentsMap[$msgId] = $attachments;
                }
            }
        }

        return $msgIdToAttachmentsMap;
    }

    public function getMessages($searchQuery)
    {
        return imap_search($this->imapStream, $searchQuery, SE_UID);
    }

    public function connect($mailbox, $user, $password)
    {
        $this->imapStream = imap_open($mailbox, $user, $password);
    }

    public function disconnect()
    {
        imap_close($this->imapStream);
    }

}

Here’s a little example of how it can be used. This will query all messages with “pickup::” in the subject line and print out the messageID of all messages with attachments, followed by the filenames of all attachments for that message.

$imapPickup = new ImapPickup();
$imapPickup->connect("{mail.hotspotdot.net:143}INBOX", "test@test.net", "pass123");

$messages = $imapPickup->getMessages("SUBJECT pickup::");
$attachments = $imapPickup->getAttachmentsFromMessages($messages);

foreach($attachments as $msgId => $attArr)
{
    echo "<p>{$msgId} => ";
    
    foreach($attArr as $attachment)
    {
        echo $attachment[0];
        echo ",";
    }

    echo "</p>";
}

$imapPickup->disconnect();

The array for a single file attachment contains 3 entries:

[0] => filename
[1] => major part number
[2] => minor part number

getAttachments(), findAttachments(), and findAttachmentsRec() will return an array of such entries (or an empty array is there are no attachments). getAttachmentsFromMessages() will return a map from messageID => array of single attachments.

The part number (both major and minor) is needed to retrieve the contents of an attachment. For getAttachmentContent(), simply use the major number if the minor number is <= 0, or concatenate them with a period separating them (e.g. "2.3").

attachmentemailimapImapPickupPHP

PHP session_start() “Node no longer exists”

Jun 15 2010 · PHP

I stumbled upon this error earlier today as I attempted to store the value of a SimpleXMLElement as a session variable. I was able to narrow down the issue thanks to this post on bytes.com.

According to a user post on the PHP site, this occurs because SimpleXML returns a reference to an object containing the node value, and you can’t store a reference as a session variable. The value must be dereferenced and copied which can be done by casting.

// Bad!
$storageBoxSize = $xml->data->storage_box_size;

// Good!
$storageBoxSize = (int)$xml->data->storage_box_size;

I find myself hating loosely-typed languages more and more.

errorPHPsession_startsimplexmlSimpleXMLElement

PHP Array of MIME Types

Mar 22 2010 · PHP

Useful for when you need to set HTTP headers when serving file downloads. The mime_content_type function is deprecated and in my case was just returning an empty string. The recommended alternative, using PECL finfo_file, should work fine, but adding and compiling in a PECL extension just for this seems like overkill, especially as you have more control using an array. I’m also never crazy about adding dependencies unless they’re absolutely necessary.

This is from snipplr, but includes the image/png type which was, curiously, missing.

$mime_types = array(
    "323" => "text/h323",
    "acx" => "application/internet-property-stream",
    "ai" => "application/postscript",
    "aif" => "audio/x-aiff",
    "aifc" => "audio/x-aiff",
    "aiff" => "audio/x-aiff",
    "asf" => "video/x-ms-asf",
    "asr" => "video/x-ms-asf",
    "asx" => "video/x-ms-asf",
    "au" => "audio/basic",
    "avi" => "video/x-msvideo",
    "axs" => "application/olescript",
    "bas" => "text/plain",
    "bcpio" => "application/x-bcpio",
    "bin" => "application/octet-stream",
    "bmp" => "image/bmp",
    "c" => "text/plain",
    "cat" => "application/vnd.ms-pkiseccat",
    "cdf" => "application/x-cdf",
    "cer" => "application/x-x509-ca-cert",
    "class" => "application/octet-stream",
    "clp" => "application/x-msclip",
    "cmx" => "image/x-cmx",
    "cod" => "image/cis-cod",
    "cpio" => "application/x-cpio",
    "crd" => "application/x-mscardfile",
    "crl" => "application/pkix-crl",
    "crt" => "application/x-x509-ca-cert",
    "csh" => "application/x-csh",
    "css" => "text/css",
    "dcr" => "application/x-director",
    "der" => "application/x-x509-ca-cert",
    "dir" => "application/x-director",
    "dll" => "application/x-msdownload",
    "dms" => "application/octet-stream",
    "doc" => "application/msword",
    "dot" => "application/msword",
    "dvi" => "application/x-dvi",
    "dxr" => "application/x-director",
    "eps" => "application/postscript",
    "etx" => "text/x-setext",
    "evy" => "application/envoy",
    "exe" => "application/octet-stream",
    "fif" => "application/fractals",
    "flr" => "x-world/x-vrml",
    "gif" => "image/gif",
    "gtar" => "application/x-gtar",
    "gz" => "application/x-gzip",
    "h" => "text/plain",
    "hdf" => "application/x-hdf",
    "hlp" => "application/winhlp",
    "hqx" => "application/mac-binhex40",
    "hta" => "application/hta",
    "htc" => "text/x-component",
    "htm" => "text/html",
    "html" => "text/html",
    "htt" => "text/webviewhtml",
    "ico" => "image/x-icon",
    "ief" => "image/ief",
    "iii" => "application/x-iphone",
    "ins" => "application/x-internet-signup",
    "isp" => "application/x-internet-signup",
    "jfif" => "image/pipeg",
    "jpe" => "image/jpeg",
    "jpeg" => "image/jpeg",
    "jpg" => "image/jpeg",
    "js" => "application/x-javascript",
    "latex" => "application/x-latex",
    "lha" => "application/octet-stream",
    "lsf" => "video/x-la-asf",
    "lsx" => "video/x-la-asf",
    "lzh" => "application/octet-stream",
    "m13" => "application/x-msmediaview",
    "m14" => "application/x-msmediaview",
    "m3u" => "audio/x-mpegurl",
    "man" => "application/x-troff-man",
    "mdb" => "application/x-msaccess",
    "me" => "application/x-troff-me",
    "mht" => "message/rfc822",
    "mhtml" => "message/rfc822",
    "mid" => "audio/mid",
    "mny" => "application/x-msmoney",
    "mov" => "video/quicktime",
    "movie" => "video/x-sgi-movie",
    "mp2" => "video/mpeg",
    "mp3" => "audio/mpeg",
    "mpa" => "video/mpeg",
    "mpe" => "video/mpeg",
    "mpeg" => "video/mpeg",
    "mpg" => "video/mpeg",
    "mpp" => "application/vnd.ms-project",
    "mpv2" => "video/mpeg",
    "ms" => "application/x-troff-ms",
    "mvb" => "application/x-msmediaview",
    "nws" => "message/rfc822",
    "oda" => "application/oda",
    "p10" => "application/pkcs10",
    "p12" => "application/x-pkcs12",
    "p7b" => "application/x-pkcs7-certificates",
    "p7c" => "application/x-pkcs7-mime",
    "p7m" => "application/x-pkcs7-mime",
    "p7r" => "application/x-pkcs7-certreqresp",
    "p7s" => "application/x-pkcs7-signature",
    "pbm" => "image/x-portable-bitmap",
    "pdf" => "application/pdf",
    "pfx" => "application/x-pkcs12",
    "pgm" => "image/x-portable-graymap",
    "pko" => "application/ynd.ms-pkipko",
    "pma" => "application/x-perfmon",
    "pmc" => "application/x-perfmon",
    "pml" => "application/x-perfmon",
    "pmr" => "application/x-perfmon",
    "pmw" => "application/x-perfmon",
    "png" => "image/png",
    "pnm" => "image/x-portable-anymap",
    "pot" => "application/vnd.ms-powerpoint",
    "ppm" => "image/x-portable-pixmap",
    "pps" => "application/vnd.ms-powerpoint",
    "ppt" => "application/vnd.ms-powerpoint",
    "prf" => "application/pics-rules",
    "ps" => "application/postscript",
    "pub" => "application/x-mspublisher",
    "qt" => "video/quicktime",
    "ra" => "audio/x-pn-realaudio",
    "ram" => "audio/x-pn-realaudio",
    "ras" => "image/x-cmu-raster",
    "rgb" => "image/x-rgb",
    "rmi" => "audio/mid",
    "roff" => "application/x-troff",
    "rtf" => "application/rtf",
    "rtx" => "text/richtext",
    "scd" => "application/x-msschedule",
    "sct" => "text/scriptlet",
    "setpay" => "application/set-payment-initiation",
    "setreg" => "application/set-registration-initiation",
    "sh" => "application/x-sh",
    "shar" => "application/x-shar",
    "sit" => "application/x-stuffit",
    "snd" => "audio/basic",
    "spc" => "application/x-pkcs7-certificates",
    "spl" => "application/futuresplash",
    "src" => "application/x-wais-source",
    "sst" => "application/vnd.ms-pkicertstore",
    "stl" => "application/vnd.ms-pkistl",
    "stm" => "text/html",
    "svg" => "image/svg+xml",
    "sv4cpio" => "application/x-sv4cpio",
    "sv4crc" => "application/x-sv4crc",
    "t" => "application/x-troff",
    "tar" => "application/x-tar",
    "tcl" => "application/x-tcl",
    "tex" => "application/x-tex",
    "texi" => "application/x-texinfo",
    "texinfo" => "application/x-texinfo",
    "tgz" => "application/x-compressed",
    "tif" => "image/tiff",
    "tiff" => "image/tiff",
    "tr" => "application/x-troff",
    "trm" => "application/x-msterminal",
    "tsv" => "text/tab-separated-values",
    "txt" => "text/plain",
    "uls" => "text/iuls",
    "ustar" => "application/x-ustar",
    "vcf" => "text/x-vcard",
    "vrml" => "x-world/x-vrml",
    "wav" => "audio/x-wav",
    "wcm" => "application/vnd.ms-works",
    "wdb" => "application/vnd.ms-works",
    "wks" => "application/vnd.ms-works",
    "wmf" => "application/x-msmetafile",
    "wps" => "application/vnd.ms-works",
    "wri" => "application/x-mswrite",
    "wrl" => "x-world/x-vrml",
    "wrz" => "x-world/x-vrml",
    "xaf" => "x-world/x-vrml",
    "xbm" => "image/x-xbitmap",
    "xla" => "application/vnd.ms-excel",
    "xlc" => "application/vnd.ms-excel",
    "xlm" => "application/vnd.ms-excel",
    "xls" => "application/vnd.ms-excel",
    "xlt" => "application/vnd.ms-excel",
    "xlw" => "application/vnd.ms-excel",
    "xof" => "x-world/x-vrml",
    "xpm" => "image/x-xpixmap",
    "xwd" => "image/x-xwindowdump",
    "z" => "application/x-compress",
    "zip" => "application/zip" );

MIMEmime typemime_content_typemultipurpose internet mail extensionsPHP

Altering the file size limit for PHP uploads

May 12 2008 · PHP

Well, of course, you can edit php.ini to change the file size limit, but for cheap (typically, shared) hosting solutions the issue that pops us is that you don’t have access to it.

I was researching how I could do this w/o modifying php.ini and discovered a forum post that explain a few methods of how it could be done.

Method A (the one that worked for me):
Edit your .htaccess file (if you don’t have one in your web site’s public root directory, you should be able to just create one) and add the following:
php_value post_max_size 8M php_value upload_max_filesize 8M(of course, replace 8M with however many megabytes you wish to limit the file sizes of uploads by)

Method B (the one that didn’t work for me):
Use the ini_set function,
ini_set('post_max_size', '8M'); ini_set('upload_max_filesize', '8M');
Method C (the one I didn’t try):
Use a custom php.ini. Not sure exactly what would be involved here, but you’d likely have to programmatically make a copy of the existing php.ini file, modify the copy, and then place the copy in the directory of the script that needs it or in your web site’s public root directory.