Poor quality of the text unit after poppler-normaliser
Recently I tried to process some PDF documents into the WebLab bundle but using poppler-normaliser in case of PDF. While there is big progress in the look and feel of the system, thanks to the structure conservation in the document viewer, there is also one major drawback. The quality of the text inside the main media unit is poorer that when using default tika normaliser.
The fact is that tika-normaliser something generating new media units to separate some text blocks or at least adding new lines or tab when necessary.
The poppler-normaliser, make the assumption that this is structure information and that it should not be part of the text. As a result, when handling a table for instance, it concatenates each cell in a single word, but adds structure annotations to tell that these should be printed on screen at different position. However, the WebLab services that follows the normaliser only take the text content into account (and IMO, should not necessarily depend on the structure annotations). This leads to a less accurate language identification, a very very poor named entity extraction, and even a bad search system text inside cells.
It would a least be good to a some space between the words! Or even better, add some newlines inside the text when needed the document viewer portlet is not the only component that handles the text.