Skip to content

Upgrade morph database to full UTF-8Β #1453

@ianheggie-oaf

Description

@ianheggie-oaf

Description

HoneyBadger alert:
Excon::Error::Socket: Mysql2::Error: Incorrect string value: '\xF0\x9F\x94\x8D G...' for column 'text' at row 1 (ActiveRecord::StatementInvalid)
Backtrace:

line 63 of [PROJECT_ROOT]/app/lib/morph/runner.rb: log
line 45 of [PROJECT_ROOT]/app/lib/morph/runner.rb: block in synch_and_go_with_logging!
line 194 of [PROJECT_ROOT]/app/lib/morph/docker_runner.rb: block in attach_to_run

Error is in code that is logging output from a scraper - it has a 4 byte UTF-8 character, confirmed:

morph (main)$ hd ,utf
00000000  f0 9f 94 8d 20 47 2e 2e  2e 0a                    |.... G....|
0000000a
morph (main)$ cat ,utf
πŸ” G...

Describe the solution you'd like

Update the database to full unicode.

  1. Update the default encoding and collation character set in the app
  2. create a new database with the new config
  3. update the app to use the new database name
  4. deploy (site goes down)
  5. migrate the data from old to new database with new config
  6. remove maintenance page file
  7. check site
  8. remove old database

Describe alternatives you've considered

Remove the 4 byte emoji from scraper_utils, assuming its the only place the issue is.

Update the scrapers that use it.

This is probably the right time to move the repo across for scraper_utils as well.

Additional context

This is typical for databases that came from MySql 5.2: In MySQL version 5.2, utf8mb3 was the default character set for new installations, while utf8mb4 was introduced later as an option in MySQL 5.5.

It is possibly from a debug message from

scraper_utils/lib/scraper_utils/debug_utils.rb
64:      LogUtils.log "πŸ” #{http_method.upcase} #{url}"

I confirmed the mysql database doesn't currently support 4 byte UTF-8:

create_table "log_lines", id: :integer, options: "ENGINE=InnoDB DEFAULT CHARSET=utf8mb3 COLLATE=utf8mb3_unicode_ci", force: :cascade do |t|

Related doc: https://dev.mysql.com/doc/refman/8.4/en/charset-unicode-conversion.html

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions