data cleaning

#1
Hi! I am reading data from scanned medical documents (provider Notes) using Pytesseract OCR. The resultant data has some noise and misspells. My ultimate goal is to extract useful medical information from data. Right now I'm stuck with how to correct both medical and English misspells. I have to create a dictionary which contains both medical and English words. I'm looking for direction on what steps I need to perform.
 
#3
Hi! I am reading data from scanned medical documents (provider Notes) using Pytesseract OCR. The resultant data has some noise and misspells. My ultimate goal is to extract useful medical information from data. Right now I'm stuck with how to correct both medical and English misspells. I have to create a dictionary which contains both medical and English words. I'm looking for direction on what steps I need to perform.
Since you're using Python, I imagine you'll need to make use of some NLP approaches to somehow correct the misspellings... As for this "noise" in your dataset, I guess it depends on what you mean.