PHP preg_split в пробелах, но не внутри тегов

Я использую preg_split("/\"[^\"]*\"(*SKIP)(*F)|\x20/", $input_line); и запускает его на phpliveregex.com, он производит массив:

 array(10 0=>test 1=>or 2=>oh 3=>yeah 4=>and 5=> 6=>oh 7=>yeah 8=> 9=>"ye we 'hold' it" ) 

НЕ, что я хочу, он должен быть разделен пробелами только за пределами html-тегов, например:

 array(5 0=>test 1=>or 2=>oh yeah 3=>and 4=>oh yeah 5=>"ye we 'hold' it" ) 

в этом регулярном выражении я могу только добавить исключение в «двойную кавычку», но реально нужна помощь, чтобы добавить больше, например тег


любое объяснение о том, как работает это регулярное выражение, также оценивается.

Легче использовать DOMDocument так как вам не нужно описывать, что такое тег html и как он выглядит. Вам нужно только проверить тип nodeType. Когда это textNode, разделите его на preg_match_all (это более удобно, чем создавать шаблон для preg_split ) :

 $html = 'spaces in a text node test or oh yeah and oh yeah "ye we \'hold\' it" "unclosed double quotes at the end'; $dom = new DOMDocument; $dom->loadHTML('
' . $html . '
', LIBXML_HTML_NOIMPLIED); $nodeList = $dom->documentElement->childNodes; $results = []; foreach ($nodeList as $childNode) { if ($childNode->nodeType == XML_TEXT_NODE && preg_match_all('~[^\s"]+|"[^"]*"?~', $childNode->nodeValue, $m)) $results = array_merge($results, $m[0]); else $results[] = $dom->saveHTML($childNode); } print_r($results);

Примечание. Я выбрал поведение по умолчанию, когда часть двойной кавычки остается закрытой (без закрывающей цитаты) , не стесняйтесь ее изменять.

Примечание2: Иногда константы LIBXML_ не определены. Вы можете решить эту проблему, прежде чем тестировать ее и определить ее при необходимости:

 if (!defined('LIBXML_HTML_NOIMPLIED')) define('LIBXML_HTML_NOIMPLIED', 8192); 

Описание

Вместо того, чтобы использовать команду split, просто соответствуют разделяемым разделам

<(?:(?:img)(?=[\s>\/])(?:[^>=]|=(?:'[^']*'|"[^"]*"|[^'"\s>]*))*\s?\/?>|(a|span|pre|code|strong|b|em|i)(?=[\s>\\])(?:[^>=]|=(?:'[^']*'|"[^"]*"|[^'"\s>]*))*\s?\/?>.*?<\/\1>)|(?:"[^"]*"|[^"<]*)*

Визуализация регулярных выражений

пример

Демо-версия

https://regex101.com/r/bK8iL3/1

Образец текста

Обратите внимание на сложный краевой вариант во втором абзаце

 test or  this  oh yeah  and oh yeah Here we are "ye we 'hold' it" somegfsfdroides

Образцы матчей

 MATCH 1 0. [0-11] `test` MATCH 2 0. [11-15] ` or ` MATCH 3 0. [15-38] ` this ` MATCH 4 0. [38-56] ` oh yeah ` MATCH 5 0. [56-61] ` and ` MATCH 6 0. [61-75] `oh yeah` MATCH 7 0. [75-111] ` Here we are "ye we 'hold' it" some` MATCH 8 0. [111-117] `` MATCH 9 0. [117-121] `gfsf` MATCH 10 0. [121-213] `droides` MATCH 11 0. [213-224] `

` MATCH 12 0. [224-237] `` MATCH 13 0. [237-254] `` MATCH 14 0. [254-261] `` MATCH 15 0. [261-270] `` MATCH 16 0. [270-277] ``

объяснение

 NODE EXPLANATION ---------------------------------------------------------------------- < '<' ---------------------------------------------------------------------- (?: group, but do not capture: ---------------------------------------------------------------------- (?: group, but do not capture: ---------------------------------------------------------------------- img 'img' ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- (?= look ahead to see if there is: ---------------------------------------------------------------------- [\s>\/] any character of: whitespace (\n, \r, \t, \f, and " "), '>', '\/' ---------------------------------------------------------------------- ) end of look-ahead ---------------------------------------------------------------------- (?: group, but do not capture (0 or more times (matching the most amount possible)): ---------------------------------------------------------------------- [^>=] any character except: '>', '=' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- = '=' ---------------------------------------------------------------------- (?: group, but do not capture: ---------------------------------------------------------------------- ' '\'' ---------------------------------------------------------------------- [^']* any character except: ''' (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ' '\'' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- " '"' ---------------------------------------------------------------------- [^"]* any character except: '"' (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- " '"' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- [^'"\s>]* any character except: ''', '"', whitespace (\n, \r, \t, \f, and " "), '>' (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- )* end of grouping ---------------------------------------------------------------------- \s? whitespace (\n, \r, \t, \f, and " ") (optional (matching the most amount possible)) ---------------------------------------------------------------------- \/? '/' (optional (matching the most amount possible)) ---------------------------------------------------------------------- > '>' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- a 'a' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- span 'span' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- pre 'pre' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- code 'code' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- strong 'strong' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- b 'b' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- em 'em' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- i 'i' ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- (?= look ahead to see if there is: ---------------------------------------------------------------------- [\s>\\] any character of: whitespace (\n, \r, \t, \f, and " "), '>', '\\' ---------------------------------------------------------------------- ) end of look-ahead ---------------------------------------------------------------------- (?: group, but do not capture (0 or more times (matching the most amount possible)): ---------------------------------------------------------------------- [^>=] any character except: '>', '=' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- = '=' ---------------------------------------------------------------------- (?: group, but do not capture: ---------------------------------------------------------------------- ' '\'' ---------------------------------------------------------------------- [^']* any character except: ''' (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ' '\'' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- " '"' ---------------------------------------------------------------------- [^"]* any character except: '"' (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- " '"' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- [^'"\s>]* any character except: ''', '"', whitespace (\n, \r, \t, \f, and " "), '>' (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- )* end of grouping ---------------------------------------------------------------------- \s? whitespace (\n, \r, \t, \f, and " ") (optional (matching the most amount possible)) ---------------------------------------------------------------------- \/? '/' (optional (matching the most amount possible)) ---------------------------------------------------------------------- > '>' ---------------------------------------------------------------------- .*? any character (0 or more times (matching the least amount possible)) ---------------------------------------------------------------------- < '<' ---------------------------------------------------------------------- \/ '/' ---------------------------------------------------------------------- \1 what was matched by capture \1 ---------------------------------------------------------------------- > '>' ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- (?: group, but do not capture (0 or more times (matching the most amount possible)): ---------------------------------------------------------------------- " '"' ---------------------------------------------------------------------- [^"]* any character except: '"' (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- " '"' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- [^"<]* any character except: '"', '<' (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- )* end of grouping ----------------------------------------------------------------------