Commit Graph

12 Commits

Author SHA1 Message Date
Dmitry S
d5a4605d2a (core) Improve encoding detection for csv imports, and make encoding an editable option.
Summary:
- Using a sample of data was causing poor detection if the sample were
  cut mid-character. Switch to using line-based detection.
- Add a simple option for changing encoding. No convenient UI is offered
  since config UI is auto-generated, but this at least makes it possible to
  recover from bad guesses.
- Upgrades chardet library for good measure.

- Also fixes python3-building step, to more reliably rebuild Python
  dependencies when requirements3.* files change.

Test Plan:
Added a python-side test case, and a browser test that encodings can
be switched, errors are displayed, and wrong encodings fail recoverably.

Reviewers: alexmojaki

Reviewed By: alexmojaki

Differential Revision: https://phab.getgrist.com/D3979
2023-08-24 09:50:52 -04:00
Dmitry S
17569561bf (core) Fix issue that ints would be imported with a trailing ".0" from Google Sheets.
Summary:
Whole numbers, when imported from Excel into a Text column show up
without decimals (e.g. "300"), but when imported from Google Sheets show
up with decimals (e.g. "300.0"). The decimals are hard for end-users to
remove. Fix by treating whole numbers consistently as ints.

Test Plan: Added a fixture reproducing the issue, and a test case.

Reviewers: georgegevoian

Reviewed By: georgegevoian

Differential Revision: https://phab.getgrist.com/D3800
2023-02-26 15:24:15 -05:00
Yohan Boniface
4ff5a2eaa7
Be more accepting with None value in headers candidate (#331)
We already filter out a line will only None values, and sometimes
Excel of LibreOffice mistakes the real number of columns adding
one or more that have no value at all.
2022-10-31 15:57:26 -04:00
Yohan Boniface
7bd895ef42 Add BooleanConverter to map proper boolean cells to a Bool column
Note that only proper boolean will be considered, but not integers
nor truthy or falsy strings.
2022-09-10 07:07:45 +02:00
Yohan Boniface
8bc5c7d595
Fix columns with falsy cells wrongly parsed as dates (#276)
Eg. before this commit, this table would result in Date columns:

| A     | B |
| ----- | -- |
| FALSE | 0 |

For now, even FALSE is parsed as Numeric (not sure why we don't have
a BooleanConverter).
2022-09-09 15:13:34 -04:00
Alex Hall
42afb17e36 (core) Run and test imports only in Python 3, upgrade openpyxl, fix weird date handling
Summary:
Python 2 only needs to be supported for the sake of old documents and formulas. This doesn't apply to the separate sandboxes that parse files for imports. Using Python 3 only allows using newer libraries and library versions. In particular, the latest version of openpyxl doesn't support Python 2. This will also make it easier to make other similar changes in the future, such as replacing messytables with a modern library. See https://grist.slack.com/archives/C0234CPPXPA/p1661261829343999?thread_ts=1661260442.837959&cid=C0234CPPXPA

The latest openpyxl is better at handling a particular edge case with broken dates in Excel, but still doesn't quite do what we want, so we monkeypatch it. Discussion: https://grist.slack.com/archives/C02EGJ1FUCV/p1661440851911869?thread_ts=1661154219.515549&cid=C02EGJ1FUCV

Setting `preferredPythonVersion` to '3' in SafePythonComponent ensures that JS always creates import sandboxes that use Python 3. Within Python, a module used by all imports will raise an error in Python 2. Python unit tests of imports are now only run in Python 3, using the `load_tests` protocol of `unittest`.

Test Plan: Mostly existing tests. Added another strange date to the Excel fixture.

Reviewers: dsagal

Reviewed By: dsagal

Subscribers: dsagal

Differential Revision: https://phab.getgrist.com/D3606
2022-09-02 16:27:34 +02:00
George Gevoian
9b08666f96 (core) Handle importing xls files with invalid dimensions
Summary:
This addresses a rare bug where xls files with invalid dimensions
could not be imported into Grist due to how openpyxl handles
parsing them.

Test Plan: Server test.

Reviewers: alexmojaki

Reviewed By: alexmojaki

Differential Revision: https://phab.getgrist.com/D3485
2022-06-16 08:39:17 -07:00
Alex Hall
6c90de4d62 (core) Switch excel import parsing from messytables+xlrd to openpyxl, and ignore empty rows
Summary:
Use openpyxl instead of messytables (which used xlrd internally) in import_xls.py.

Skip empty rows since excel files can easily contain huge numbers of them.

Drop support for xls files (which openpyxl doesn't support) in favour of the newer xlsx format.

Fix some details relating to python virtualenvs and dependencies, as Jenkins was failing to find new Python dependencies.

Test Plan: Mostly relying on existing tests. Updated various tests which referred to xls files instead of xlsx. Added a Python test for skipping empty rows.

Reviewers: georgegevoian

Reviewed By: georgegevoian

Differential Revision: https://phab.getgrist.com/D3406
2022-05-12 14:43:21 +02:00
Dmitry S
64d9faed5a (core) Fix import parsing from choking up on Python isdigit() surprises
Summary:
Python isdigit() returns true for unicode characters such as "²", which fail
when used as an argument to int().

Instead, be explicit about only considering characters 0-9 to be digits.

Test Plan: Added a test case which produces an error without this change.

Reviewers: alexmojaki

Reviewed By: alexmojaki

Differential Revision: https://phab.getgrist.com/D3027
2021-09-20 16:17:34 -04:00
Dmitry S
26356fe588 (core) Fix bug with "maximum recursion depth exceeded" in imports.
Summary:
Our date-guessing logic analyzes text in full looking for date parts.
This diff skip all that work when text is so long that we don't need to
consider it to be a valid date.

This is a quick fix. There are probably many other cases when we don't
need to try hard to parse arbitrary text as dates.

Test Plan: Added a fixture and test case that would trigger the error without the fix.

Reviewers: paulfitz

Subscribers: paulfitz

Differential Revision: https://phab.getgrist.com/D2992
2021-08-20 17:44:48 -04:00
Alex Hall
4d526da58f (core) Move file import plugins into core/sandbox/grist
Summary:
Move all the plugins python code into the main folder with the core code.

Register file importing functions in the same main.py entrypoint as the data engine.

Remove options relating to different entrypoints and code directories. The only remaining plugin-specific option in NSandbox is the import directory/mount, i.e. where files to be parsed are placed.

Test Plan: this

Reviewers: paulfitz

Reviewed By: paulfitz

Subscribers: dsagal

Differential Revision: https://phab.getgrist.com/D2965
2021-08-09 18:37:14 +02:00
Paul Fitzpatrick
b82eec714a (core) move data engine code to core
Summary:
this moves sandbox/grist to core, and adds a requirements.txt
file for reconstructing the content of sandbox/thirdparty.

Test Plan:
existing tests pass.
Tested core functionality manually.  Tested docker build manually.

Reviewers: dsagal

Reviewed By: dsagal

Differential Revision: https://phab.getgrist.com/D2563
2020-07-29 08:57:25 -04:00