0

I have data of string in a pandas dataframe column. I need to convert it to either parsable json string or dict type so that I can read / extract values from it.

Mock / sample DataFrame:

df = pd.DataFrame({'col1': [6010, 6015, 6020, 6025],
                   'json_col': ["{'Id': '060',
                                  'Date': '20210908',
                                  'value': {'Id': '060',
                                            'Code': '06037'}
                                  }",

                                  "{'Id': '061',
                                    'Date': '20210908',
                                    'value': {'Id': '060',
                                              'Code': '06038'}
                                   }",

                                   np.nan,

                                   "{'Id': '063',
                                     'Date': '20210908',
                                     'value': {'Id': '060',
                                               'Code': '06040'}
                                   }"],
            })

Here's what I tried:

df['json_col'] = df[df['json_col'].notnull()].map(lambda x: dict(eval(x)))

TypeError: eval() arg 1 must be a string, bytes or code object

Note, there are some missing values in the column and the function need to be able to handle them.

The Singularity
  • 2,428
  • 3
  • 19
  • 48
kms
  • 1,810
  • 1
  • 41
  • 92
  • There is no "json type". There is JSON *data*, which is a *string*; and there is *parsed* JSON, which is generally some nested structure of dicts and lists. What you have is a string that is *almost* in JSON format, but not quite. – Karl Knechtel Sep 13 '21 at 16:05
  • @KarlKnechtel right, I need to convert the string to parsable json string. – kms Sep 13 '21 at 16:07
  • 1
    Also, your example initialization doesn't work - you need to triple-quote multi-line strings, either use strings for your `col1` or remove leading zeroes, and use [actual syntax for a NaN](https://stackoverflow.com/questions/19374254/assigning-a-variable-nan-in-python-without-numpy). – Karl Knechtel Sep 13 '21 at 16:11

2 Answers2

1

When you apply a mask like df[df['json_col'].notnull()], this result includes all columns - even though you used a specific column to determine the mask - because you're simply telling it which rows to use (the ones where that column isn't null). The 'col1' column values presumably aren't strings in your actual data. This way worked for me:

df['json_col'] = df[df['json_col'].notnull()]['json_col'].map(lambda x: dict(eval(x)))

Alternately, you can pull the test into the lambda:

df['json_col'] = df['json_col'].map(lambda x: dict(eval(x)) if pd.notnull(x) else x)

That said, please use ast.literal_eval for this kind of evaluation, if you can't just make sure of having real JSON ahead of time.

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
0

You could also use json module

df['json_col'] = df[~df['json_col'].isna()]["json_col"] \
                  .apply(lambda st_dict: json.loads(st_dict.replace("'", '"')))

note json module expects property names in dictionary to be enclosed by double quotes hence I've replace single quote to double.

It is always good practice to try to vectorize your method, you can refer to vectorized string operation

p.s. I couldn't find vectorize operation or replacing or converting str_dict to dict. If anyone know please let me know!

haneulkim
  • 4,406
  • 9
  • 38
  • 80
  • There may be other things that need to be fixed in the data; but yes, if it's *supposed to be* JSON then we should use JSON tools. – Karl Knechtel Sep 14 '21 at 20:38