PHP中的平衡换行(最小不整齐度)

16
我将在PHP中编写一个自动换行算法。我想把短语分割成最多包含m个字符的n行小文本块(没有给出n,所以需要尽可能多的行)。特别之处在于,尽可能平衡每行的长度(以字符计)跨越行。
输入文本示例:
How to do things

错误的输出(这是正常的换行行为),m=6

How to
do
things

期望输出,始终为m = 6

How 
to do 
things

有人对如何实现此功能有建议或指导方针吗?基本上,我正在寻找一些东西,可以将短语漂亮地打印在两行或三行(尽可能相等的长度)上。


更新: 看起来我正在寻找一个最小不整齐换行算法。但我找不到任何实现在真正的编程语言中(任何一个,然后我可以将其转换为PHP)。


更新2: 我发起了一个悬赏。在任何过程式语言中,是否可能不存在最小不整齐算法的公共实现?我需要一些以可转换为过程指令的方式编写的东西。现在我能找到的只是一堆(通用的)方程式,但是它们需要优化搜索过程。我也将感激只能近似那个最优搜索算法的实现。

我需要关于如何实现这个函数的建议和指导。我不知道从哪里开始,我以前从未做过这样的事情。 - lorenzo-s
我建议你将空格视为一个字符。在计算总字符数时,如果单词的总字符数(m)大于每行限制字符数(n),你可以在第 n+1 行末尾补充空格来弥补。 - nand
2
值得一读:Knuth的“将段落分成行”的文章。 - Justin ᚅᚔᚈᚄᚒᚔ
你假设使用等宽字体了吗? - nickf
@nickf 是的,是的。否则太难了 :) - lorenzo-s
6个回答

10

我按照Alex的思路实现了维基百科算法,但是直接使用PHP编写(这对我来说是一个有趣的练习)。理解如何使用最优成本函数f(j),即“递归”部分,不是很容易。感谢Alex提供了注释详细的代码。

/** 
 * minimumRaggedness
 *
 * @param string $input paragraph. Each word separed by 1 space.
 * @param int $LineWidth the max chars per line.
 * @param string $lineBreak wrapped lines separator.
 * 
 * @return string $output the paragraph wrapped.
 */
function minimumRaggedness($input, $LineWidth, $lineBreak = "\n")
{
    $words = explode(" ", $input);
    $wsnum = count($words);
    $wslen = array_map("strlen", $words);
    $inf = 1000000; //PHP_INT_MAX;

    // keep Costs
    $C = array();

    for ($i = 0; $i < $wsnum; ++$i)
    {
        $C[] = array();
        for ($j = $i; $j < $wsnum; ++$j)
        {
            $l = 0;
            for ($k = $i; $k <= $j; ++$k)
                $l += $wslen[$k];
            $c = $LineWidth - ($j - $i) - $l;
            if ($c < 0)
                $c = $inf;
            else
                $c = $c * $c;
            $C[$i][$j] = $c;
        }
    }

    // apply recurrence
    $F = array();
    $W = array();
    for ($j = 0; $j < $wsnum; ++$j)
    {
        $F[$j] = $C[0][$j];
        $W[$j] = 0;
        if ($F[$j] == $inf)
        {
            for ($k = 0; $k < $j; ++$k)
            {
                $t = $F[$k] + $C[$k + 1][$j];
                if ($t < $F[$j])
                {
                    $F[$j] = $t;
                    $W[$j] = $k + 1;
                }
            }
        }
    }

    // rebuild wrapped paragraph
    $output = "";
    if ($F[$wsnum - 1] < $inf)
    {
        $S = array();
        $j = $wsnum - 1;
        for ( ; ; )
        {
            $S[] = $j;
            $S[] = $W[$j];
            if ($W[$j] == 0)
                break;
            $j = $W[$j] - 1;
        }

        $pS = count($S) - 1;
        do
        {
            $i = $S[$pS--];
            $j = $S[$pS--];
            for ($k = $i; $k < $j; $k++)
                $output .= $words[$k] . " ";
            $output .= $words[$k] . $lineBreak;
        }
        while ($j < $wsnum - 1);
    }
    else
        $output = $input;

    return $output;
}

?>


1
我希望我能分配赏金 :( 谢谢大家,maniek、Alex和chac。你们做得很好,很辛苦。是的,chac,最优成本函数递归部分是最难的,我对此一无所知,这就是我放弃的地方。 - lorenzo-s

4

快速而简单,在c++中实现

#include <sstream>
#include <iostream>
#include <vector>
#include <cstdlib>
#include <memory.h>

using namespace std;

int cac[1000][1000];
string res[1000][1000];
vector<string> words;
int M;

int go(int a, int b){
    if(cac[a][b]>= 0) return cac[a][b];
    if(a == b) return 0;

    int csum = -1;
    for(int i=a; i<b; ++i){
    csum += words[i].size() + 1;
    }
    if(csum <= M || a == b-1){
    string sep = "";
        for(int i=a; i<b; ++i){
            res[a][b].append(sep);
            res[a][b].append(words[i]);
            sep = " ";
    }
    return cac[a][b] = (M-csum)*(M-csum);
    }

    int ret = 1000000000;
    int best_sp = -1;
    for(int sp=a+1; sp<b; ++sp){
    int cur = go(a, sp) + go(sp,b);
    if(cur <= ret){
        ret = cur;
        best_sp = sp;
    }
    }
    res[a][b] = res[a][best_sp] + "\n" + res[best_sp][b];
    return cac[a][b] = ret;
}


int main(int argc, char ** argv){
    memset(cac, -1, sizeof(cac));
    M = atoi(argv[1]);
    string word;
    while(cin >> word) words.push_back(word);
    go(0, words.size());
    cout << res[0][words.size()] << endl;
}

测试:

$ echo "The quick brown fox jumps over a lazy dog" |./a.out 10
The quick
brown fox
jumps over
a lazy dog

编辑:刚刚查看了最小不整齐换行的维基百科页面。将算法更改为给定的算法(带有平方惩罚)


你有考虑过使用UTF-8编码吗? - dynamic
不需要,因为原帖作者打算转到PHP上,所以我没有费心。UTF-8不会引起问题。只要在进行大小计算时,使用字符而不是字节进行计算即可。 - maniek
@maniek 谢谢。我会尽快尝试。如果它运行良好,我将接受您的答案并开始将其转换为PHP。UTF8不是问题。可能我只会有ASCII字符串。否则,我可以自己处理 :) - lorenzo-s
它运行得非常好。做得很棒。谢谢!!!我可以在10小时内给你100分。 - lorenzo-s

2

一个C语言版本:

// This is a direct implementation of the minimum raggedness word wrapping
// algorithm from http://en.wikipedia.org/wiki/Word_wrap#Minimum_raggedness

#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include <stdlib.h>
#include <limits.h>

const char* pText = "How to do things";
int LineWidth = 6;
int WordCnt;
const char** pWords;
int* pWordLengths;

int* pC;
int* pF;
int* pW;
int* pS;

int CountWords(const char* p)
{
  int cnt = 0;

  while (*p != '\0')
  {
    while (*p != '\0' && isspace(*p)) p++;

    if (*p != '\0')
    {
      cnt++;
      while (*p != '\0' && !isspace(*p)) p++;
    }
  }

  return cnt;
}

void FindWords(const char* p, int cnt, const char** pWords, int* pWordLengths)
{
  while (*p != '\0')
  {
    while (*p != '\0' && isspace(*p)) p++;

    if (*p != '\0')
    {
      *pWords++ = p;
      while (*p != '\0' && !isspace(*p)) p++;
      *pWordLengths++ = p - pWords[-1];
    }
  }
}

void PrintWord(const char* p, int l)
{
  int i;

  for (i = 0; i < l; i++)
    printf("%c", p[i]);
}

// 1st program's argument is the text
// 2nd program's argument is the line width
int main(int argc, char* argv[])
{
  int i, j;

  if (argc >= 3)
  {
    pText = argv[1];
    LineWidth = atoi(argv[2]);
  }

  WordCnt = CountWords(pText);

  pWords = malloc(WordCnt * sizeof(*pWords));
  pWordLengths = malloc(WordCnt * sizeof(*pWordLengths));

  FindWords(pText, WordCnt, pWords, pWordLengths);

  printf("Input Text: \"%s\"\n", pText);
  printf("Line Width: %d\n", LineWidth);
  printf("Words     : %d\n", WordCnt);

#if 0
  for (i = 0; i < WordCnt; i++)
  {
    printf("\"");
    PrintWord(pWords[i], pWordLengths[i]);
    printf("\"\n");
  }
#endif

  // Build c(i,j) in pC[]
  pC = malloc(WordCnt * WordCnt * sizeof(int));
  for (i = 0; i < WordCnt; i++)
  {
    for (j = 0; j < WordCnt; j++)
      if (j >= i)
      {
        int k;
        int c = LineWidth - (j - i);
        for (k = i; k <= j; k++) c -= pWordLengths[k];
        c = (c >= 0) ? c * c : INT_MAX;
        pC[j * WordCnt + i] = c;
      }
      else
        pC[j * WordCnt + i] = INT_MAX;
  }

  // Build f(j) in pF[] and store the wrap points in pW[]
  pF = malloc(WordCnt * sizeof(int));
  pW = malloc(WordCnt * sizeof(int));
  for (j = 0; j < WordCnt; j++)
  {
    pW[j] = 0;
    if ((pF[j] = pC[j * WordCnt]) == INT_MAX)
    {
      int k;
      for (k = 0; k < j; k++)
      {
        int s;
        if (pF[k] == INT_MAX || pC[j * WordCnt + k + 1] == INT_MAX)
          s = INT_MAX;
        else
          s = pF[k] + pC[j * WordCnt + k + 1];
        if (pF[j] > s)
        {
          pF[j] = s;
          pW[j] = k + 1;
        }
      }
    }
  }

  // Print the optimal solution cost
  printf("f         : %d\n", pF[WordCnt - 1]);

  // Print the optimal solution, if any
  pS = malloc(2 * WordCnt * sizeof(int));
  if (pF[WordCnt - 1] != INT_MAX)
  {
    // Work out the solution's words by back tracking the
    // wrap points from pW[] and store them on the pS[] stack
    j = WordCnt - 1;
    for (;;)
    {
      *pS++ = j;
      *pS++ = pW[j];
      if (!pW[j]) break;
      j = pW[j] - 1;
    }
    // Print the solution line by line, word by word
    // in direct order
    do
    {
      int k;
      i = *--pS;
      j = *--pS;
      for (k = i; k <= j; k++)
      {
        PrintWord(pWords[k], pWordLengths[k]);
        printf(" ");
      }
      printf("\n");
    } while (j < WordCnt - 1);
  }

  return 0;
}

输出1:

ww.exe
Input Text: "How to do things"
Line Width: 6
Words     : 4
f         : 10
How
to do
things

输出结果2:

ww.exe "aaa bb cc ddddd" 6
Input Text: "aaa bb cc ddddd"
Line Width: 6
Words     : 4
f         : 11
aaa
bb cc
ddddd

输出 3:

ww.exe "I started a bounty for this. Is it possible that do not exist any public implementation of Minimum raggedness algorithm in any procedural language? I need something written in a way that can be translated into procedural instructions. All I can find now is just a bounch of (generic) equation that however need a optimal searhing procedure. I will be grateful also for an implementation that can only approximate that optimal searching algorithm." 60
Input Text: "I started a bounty for this. Is it possible that do not exist any public implementation of Minimum raggedness algorithm in any procedural language? I need something written in a way that can be translated into procedural instructions. All I can find now is just a bounch of (generic) equation that however need a optimal searhing procedure. I will be grateful also for an implementation that can only approximate that optimal searching algorithm."
Line Width: 60
Words     : 73
f         : 241
I started a bounty for this. Is it possible that do not
exist any public implementation of Minimum raggedness
algorithm in any procedural language? I need something
written in a way that can be translated into procedural
instructions. All I can find now is just a bounch of
(generic) equation that however need a optimal searhing
procedure. I will be grateful also for an implementation
that can only approximate that optimal searching algorithm.

干得好,运行得很顺利。抱歉,但我必须给C++版本提供赏金。更加简洁 :) - lorenzo-s
感谢您提供如此优秀、时尚的代码。理解如何使用递归输出对我来说并不容易,而您的解决方案比接受的方案要高效得多,我认为这是不争的事实。唉... - CapelliC

2
我认为最简单的方法是通过限制之间的迭代来看待它。
例如:
/** 
 * balancedWordWrap
 *
 * @param string $input
 * @param int $maxWidth the max chars per line
 */
function balancedWordWrap($input, $maxWidth = null) {
    $length = strlen($input);
    if (!$maxWidth) {
        $maxWidth = min(ceil($length / 2), 75);
    }   
    $minWidth = min(ceil($length / 2), $maxWidth / 2); 

    $permutations = array();
    $scores = array();
    $lowestScore = 999;
    $lowest = $minWidth;

    foreach(range($minWidth, $maxWidth) as $width) {
        $permutations[$width] = wordwrap($input, $width);
        $lines = explode("\n", $permutations[$width]);

        $max = 0;
        foreach($lines as $line) {
            $lineLength = strlen($line);
            if ($lineLength > $max) {
                $max = $lineLength;
            }   
        }   

        $score = 0;
        foreach($lines as $line) {
            $lineLength = strlen($line);
            $score += pow($max - $lineLength, 2); 
        }   

        $scores[$width] = $score;
        if ($score < $lowestScore) {
            $lowestScore = $score;
            $lowest = $width;
        }   
    }   

    return $permutations[$lowest];
} 

给定输入“如何做事”

输出结果为

How
to do
things

给定输入 "Mary had a little lamb"

输出结果

Mary had a
little lamb

给定输入 "This extra-long paragraph was writtin to demonstrate how the fmt(1) program handles longer inputs. When testing inputs, you don\'t want them to be too short, nor too long, because the quality of the program can only be determined upon inspection of complex content. The quick brown fox jumps over the lazy dog. Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom of speech, or of the press; or the right of the people peaceably to assemble, and to petition the Government for a redress of grievances.", 且最大宽度限制为75个字符时,输出结果如下:

This extra-long paragraph was writtin to demonstrate how the `fmt(1)`
program handles longer inputs. When testing inputs, you don't want them 
be too short, nor too long, because the quality of the program can only be
determined upon inspection of complex content. The quick brown fox jumps
over the lazy dog. Congress shall make no law respecting an establishment
of religion, or prohibiting the free exercise thereof; or abridging the
freedom of speech, or of the press; or the right of the people peaceably
to assemble, and to petition the Government for a redress of grievances.

在我看来,这不是一个最小换行算法,除非在PHP中的自动换行已经实现了它! - CapelliC
我并不会假装这是@chac - 但它确实是PHP中一个平衡的自动换行实现。我必须指出,它有缺陷 - 它不会完美无缺,word_wrap没有平衡性,它只是继续直到单词不适合为止,并在添加下一个单词之前添加一个新行。我重新命名了这个函数,只是为了更清晰一点。 - AD7six
嗯... 如果我理解正确的话,它只是尝试不同的最大行长度,然后计算哪个更好地“压缩”文本。谢谢你的努力,+1。希望有人会用它。但是我必须接受给我一个真正的最小杂乱度实现的答案,这就是我正在寻找的。 - lorenzo-s
@AD7six 我可以向你保证这不是机器翻译。只需查看我的个人资料,你就会相信自己的眼睛 :) - lorenzo-s

0
这是一个Bash版本:
#! /bin/sh
if ! [[ "$1" =~ ^[0-9]+$ ]] ; then
    echo "Usage: balance <width> [ <string> ]"
    echo " "
    echo "  if string is not passed as parameter it will be read from STDIN\n"
    exit 2
elif [ $# -le 1 ] ; then
    LINE=`cat`
else
    LINE="$2"
fi

LINES=`echo "$LINE" | fold -s -w $1 | wc -l`
MAX=$1
MIN=0

while [ $MAX -gt $(($MIN+1)) ]
do
    TRY=$(( $MAX + $MIN >> 1 ))
    NUM=`echo "$LINE" | fold -s -w $TRY | wc -l`
    if [ $NUM -le $LINES ] ; then
        MAX=$TRY
    else
        MIN=$TRY
    fi
done

echo "$LINE" | fold -s -w $MAX

例子:

$ balance 50 "Now is the time for all good men to come to the aid of the party."
Now is the time for all good men 
to come to the aid of the party.

需要使用'fold'和'wc'命令,这两个命令通常在bash安装时就已经可用。


0

Justin的链接Knuth将段落分成行是历史上最好的答案。(新系统还应用微型排版技术,如调整字符宽度、字距等,但如果你只是想要等宽纯文本,这些额外的方法并没有帮助。)

如果你只是想解决问题,许多Linux系统上由自由软件基金会提供的fmt(1)实用程序实现了Knuth算法的变体,也试图避免在句子结尾处换行。我编写了您的输入和一个更大的示例,并通过fmt -w 20运行它们以强制每行20个字符:

$ fmt -w 20 input 
Lorem ipsum dolor
sit amet

Supercalifragilisticexpialidocious
and some other
small words

One long
extra-long-word

This extra-long
paragraph
was writtin to
demonstrate how the
`fmt(1)` program
handles longer
inputs. When
testing inputs,
you don't want them
to be too short,
nor too long,
because the quality
of the program can
only be determined
upon inspection
of complex
content. The quick
brown fox jumps
over the lazy
dog. Congress
shall make no
law respecting
an establishment
of religion, or
prohibiting the
free exercise
thereof; or
abridging the
freedom of speech,
or of the press;
or the right of the
people peaceably
to assemble,
and to petition
the Government
for a redress of
grievances.

如果您允许默认的75个字符宽度用于非平凡输入,则输出看起来会更好:

$ fmt input 
Lorem ipsum dolor sit amet

Supercalifragilisticexpialidocious and some other small words

One long extra-long-word

This extra-long paragraph was writtin to demonstrate how the `fmt(1)`
program handles longer inputs. When testing inputs, you don't want them
to be too short, nor too long, because the quality of the program can
only be determined upon inspection of complex content. The quick brown
fox jumps over the lazy dog. Congress shall make no law respecting an
establishment of religion, or prohibiting the free exercise thereof;
or abridging the freedom of speech, or of the press; or the right of
the people peaceably to assemble, and to petition the Government for a
redress of grievances.

fmt 不符合我的需求。它的工作很出色,但我需要尽可能使行长度相等。在您最后的示例中,最后一行比其他行要短得多;肯定存在一种组合,可以使所有行长度相当。顺便说一下,我只需要处理短语,每个短语最多80个字符。我正在寻找一种方法来将短语漂亮地打印在两到三行(相当)相等的长度上。 - lorenzo-s
啊,我的错误是假设您包含的示例仅用于简单的说明 - 它们准确地反映了数据! :) - sarnold

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接