gristlabs_grist-core/sandbox/grist/imports/import_utils.py

"""
Helper functions for import plugins
"""
import itertools
import logging
import os

import six
from six.moves import zip

if six.PY2:
  raise RuntimeError("Imports should use a Python 3 environment")


log = logging.getLogger(__name__)

# Get path to an imported file.
def get_path(file_source):
  importdir = os.environ.get('IMPORTDIR') or '/importdir'
  return os.path.join(importdir, file_source)

def capitalize(word):
  """Capitalize the first character in the word (without lowercasing the rest)."""
  return word[0].capitalize() + word[1:]

def _is_numeric(text):
  for t in six.integer_types + (float, complex):
    try:
      t(text)
      return True
    except (ValueError, OverflowError):
      pass
  return False


def _is_header(header, data_rows):
  """
  Returns whether header can be considered a legitimate header for data_rows.
  """
  # See if the row has any non-text values.
  for cell in header:
    if not isinstance(cell.value, six.string_types) or _is_numeric(cell.value):
      return False


  # If it's all text, see if the values in the first row repeat in other rows. That's uncommon for
  # a header.
  count_repeats = [0 for cell in header]
  for row in data_rows:
    for cell, header_cell in zip(row, header):
      if cell.value and cell.value == header_cell.value:
        return False

  return True

def _count_nonempty(row):
  """
  Returns the count of cells in row, ignoring trailing empty cells.
  """
  count = 0
  for i, c in enumerate(row):
    if not c.empty:
      count = i + 1
  return count


def find_first_non_empty_row(rows):
  """
  Returns (data_offset, header) of the first row with non-empty fields
  or (0, []) if there are no non-empty rows.
  """
  for i, row in enumerate(rows):
    if _count_nonempty(row) > 0:
      return i + 1, row
  # No non-empty rows.
  return 0, []


def expand_headers(headers, data_offset, rows):
  """
  Returns expanded header to have enough columns for all rows in the given sample.
  """
  row_length = max(itertools.chain([len(headers)],
                                   (_count_nonempty(r) for r in itertools.islice(rows, data_offset,
                                                                                 None))))
  header_values = [h.value.strip() for h in headers] + [u''] * (row_length - len(headers))
  return header_values


def headers_guess(rows):
  """
  Our own smarter version of messytables.headers_guess, which also guesses as to whether one of
  the first rows is in fact a header. Returns (data_offset, headers) where data_offset is the
  index of the first line of data, and headers is the list of guessed headers (which will contain
  empty strings if the file had no headers).
  """
  # Messytables guesses at the length of data rows, and then assumes that the first row that has
  # close to that many non-empty fields is the header, where by "close" it means 1 less.
  #
  # For Grist, it's better to mistake headers for data than to mistake data for headers. Note that
  # there is csv.Sniffer().has_header(), which tries to be clever, but it's messes up too much.
  #
  # We only consider for the header the first row with non-empty cells. It is a header if
  #   - it has no non-text fields
  #   - none of the fields have a value that repeats in that column of data

  # Find the first row with non-empty fields.
  data_offset, header = find_first_non_empty_row(rows)
  if not header:
    return data_offset, header

  # Let's see if row is really a header.
  if not _is_header(header, itertools.islice(rows, data_offset, None)):
    data_offset -= 1
    header = []

  # Expand header to have enough columns for all rows in the given sample.
  header_values = expand_headers(header, data_offset, rows)

  return data_offset, header_values
(core) support python3 in grist-core, and running engine via docker and/or gvisor Summary: * Moves essential plugins to grist-core, so that basic imports (e.g. csv) work. * Adds support for a `GRIST_SANDBOX_FLAVOR` flag that can systematically override how the data engine is run. - `GRIST_SANDBOX_FLAVOR=pynbox` is "classic" nacl-based sandbox. - `GRIST_SANDBOX_FLAVOR=docker` runs engines in individual docker containers. It requires an image specified in `sandbox/docker` (alternative images can be named with `GRIST_SANDBOX` flag - need to contain python and engine requirements). It is a simple reference implementation for sandboxing. - `GRIST_SANDBOX_FLAVOR=unsandboxed` runs whatever local version of python is specified by a `GRIST_SANDBOX` flag directly, with no sandboxing. Engine requirements must be installed, so an absolute path to a python executable in a virtualenv is easiest to manage. - `GRIST_SANDBOX_FLAVOR=gvisor` runs the data engine via gvisor's runsc. Experimental, with implementation not included in grist-core. Since gvisor runs on Linux only, this flavor supports wrapping the sandboxes in a single shared docker container. * Tweaks some recent express query parameter code to work in grist-core, which has a slightly different version of express (smoke test doesn't catch this since in Jenkins core is built within a workspace that has node_modules, and wires get crossed - in a dev environment the problem on master can be seen by doing `buildtools/build_core.sh /tmp/any_path_outside_grist`). The new sandbox options do not have tests yet, nor does this they change the behavior of grist servers today. They are there to clean up and consolidate a collection of patches I've been using that were getting cumbersome, and make it easier to run experiments. I haven't looked closely at imports beyond core. Test Plan: tested manually against regular grist and grist-core, including imports Reviewers: alexmojaki, dsagal Reviewed By: alexmojaki Differential Revision: https://phab.getgrist.com/D2942 3 years ago			`"""`
			`Helper functions for import plugins`
			`"""`
			`import itertools`
			`import logging`
			`import os`

			`import six`
			`from six.moves import zip`

(core) Run and test imports only in Python 3, upgrade openpyxl, fix weird date handling Summary: Python 2 only needs to be supported for the sake of old documents and formulas. This doesn't apply to the separate sandboxes that parse files for imports. Using Python 3 only allows using newer libraries and library versions. In particular, the latest version of openpyxl doesn't support Python 2. This will also make it easier to make other similar changes in the future, such as replacing messytables with a modern library. See https://grist.slack.com/archives/C0234CPPXPA/p1661261829343999?thread_ts=1661260442.837959&cid=C0234CPPXPA The latest openpyxl is better at handling a particular edge case with broken dates in Excel, but still doesn't quite do what we want, so we monkeypatch it. Discussion: https://grist.slack.com/archives/C02EGJ1FUCV/p1661440851911869?thread_ts=1661154219.515549&cid=C02EGJ1FUCV Setting `preferredPythonVersion` to '3' in SafePythonComponent ensures that JS always creates import sandboxes that use Python 3. Within Python, a module used by all imports will raise an error in Python 2. Python unit tests of imports are now only run in Python 3, using the `load_tests` protocol of `unittest`. Test Plan: Mostly existing tests. Added another strange date to the Excel fixture. Reviewers: dsagal Reviewed By: dsagal Subscribers: dsagal Differential Revision: https://phab.getgrist.com/D3606 2 years ago			`if six.PY2:`
			`raise RuntimeError("Imports should use a Python 3 environment")`


(core) support python3 in grist-core, and running engine via docker and/or gvisor Summary: * Moves essential plugins to grist-core, so that basic imports (e.g. csv) work. * Adds support for a `GRIST_SANDBOX_FLAVOR` flag that can systematically override how the data engine is run. - `GRIST_SANDBOX_FLAVOR=pynbox` is "classic" nacl-based sandbox. - `GRIST_SANDBOX_FLAVOR=docker` runs engines in individual docker containers. It requires an image specified in `sandbox/docker` (alternative images can be named with `GRIST_SANDBOX` flag - need to contain python and engine requirements). It is a simple reference implementation for sandboxing. - `GRIST_SANDBOX_FLAVOR=unsandboxed` runs whatever local version of python is specified by a `GRIST_SANDBOX` flag directly, with no sandboxing. Engine requirements must be installed, so an absolute path to a python executable in a virtualenv is easiest to manage. - `GRIST_SANDBOX_FLAVOR=gvisor` runs the data engine via gvisor's runsc. Experimental, with implementation not included in grist-core. Since gvisor runs on Linux only, this flavor supports wrapping the sandboxes in a single shared docker container. * Tweaks some recent express query parameter code to work in grist-core, which has a slightly different version of express (smoke test doesn't catch this since in Jenkins core is built within a workspace that has node_modules, and wires get crossed - in a dev environment the problem on master can be seen by doing `buildtools/build_core.sh /tmp/any_path_outside_grist`). The new sandbox options do not have tests yet, nor does this they change the behavior of grist servers today. They are there to clean up and consolidate a collection of patches I've been using that were getting cumbersome, and make it easier to run experiments. I haven't looked closely at imports beyond core. Test Plan: tested manually against regular grist and grist-core, including imports Reviewers: alexmojaki, dsagal Reviewed By: alexmojaki Differential Revision: https://phab.getgrist.com/D2942 3 years ago			`log = logging.getLogger(__name__)`

			`# Get path to an imported file.`
			`def get_path(file_source):`
			`importdir = os.environ.get('IMPORTDIR') or '/importdir'`
			`return os.path.join(importdir, file_source)`

			`def capitalize(word):`
			`"""Capitalize the first character in the word (without lowercasing the rest)."""`
			`return word[0].capitalize() + word[1:]`

			`def _is_numeric(text):`
			`for t in six.integer_types + (float, complex):`
			`try:`
			`t(text)`
			`return True`
			`except (ValueError, OverflowError):`
			`pass`
			`return False`


			`def _is_header(header, data_rows):`
			`"""`
			`Returns whether header can be considered a legitimate header for data_rows.`
			`"""`
			`# See if the row has any non-text values.`
			`for cell in header:`
			`if not isinstance(cell.value, six.string_types) or _is_numeric(cell.value):`
			`return False`


			`# If it's all text, see if the values in the first row repeat in other rows. That's uncommon for`
			`# a header.`
			`count_repeats = [0 for cell in header]`
			`for row in data_rows:`
			`for cell, header_cell in zip(row, header):`
			`if cell.value and cell.value == header_cell.value:`
			`return False`

			`return True`

			`def _count_nonempty(row):`
			`"""`
			`Returns the count of cells in row, ignoring trailing empty cells.`
			`"""`
			`count = 0`
			`for i, c in enumerate(row):`
			`if not c.empty:`
			`count = i + 1`
			`return count`


			`def find_first_non_empty_row(rows):`
			`"""`
			`Returns (data_offset, header) of the first row with non-empty fields`
			`or (0, []) if there are no non-empty rows.`
			`"""`
			`for i, row in enumerate(rows):`
			`if _count_nonempty(row) > 0:`
			`return i + 1, row`
			`# No non-empty rows.`
			`return 0, []`


			`def expand_headers(headers, data_offset, rows):`
			`"""`
			`Returns expanded header to have enough columns for all rows in the given sample.`
			`"""`
			`row_length = max(itertools.chain([len(headers)],`
			`(_count_nonempty(r) for r in itertools.islice(rows, data_offset,`
			`None))))`
			`header_values = [h.value.strip() for h in headers] + [u''] * (row_length - len(headers))`
			`return header_values`


			`def headers_guess(rows):`
			`"""`
			`Our own smarter version of messytables.headers_guess, which also guesses as to whether one of`
			`the first rows is in fact a header. Returns (data_offset, headers) where data_offset is the`
			`index of the first line of data, and headers is the list of guessed headers (which will contain`
			`empty strings if the file had no headers).`
			`"""`
			`# Messytables guesses at the length of data rows, and then assumes that the first row that has`
			`# close to that many non-empty fields is the header, where by "close" it means 1 less.`
			`#`
			`# For Grist, it's better to mistake headers for data than to mistake data for headers. Note that`
			`# there is csv.Sniffer().has_header(), which tries to be clever, but it's messes up too much.`
			`#`
			`# We only consider for the header the first row with non-empty cells. It is a header if`
			`# - it has no non-text fields`
			`# - none of the fields have a value that repeats in that column of data`

			`# Find the first row with non-empty fields.`
			`data_offset, header = find_first_non_empty_row(rows)`
			`if not header:`
			`return data_offset, header`

			`# Let's see if row is really a header.`
			`if not _is_header(header, itertools.islice(rows, data_offset, None)):`
			`data_offset -= 1`
			`header = []`

			`# Expand header to have enough columns for all rows in the given sample.`
			`header_values = expand_headers(header, data_offset, rows)`

			`return data_offset, header_values`