Preprocessing Judgment Text
Judgment texts need to be preprocessed before running the AI models
Preprocessing Activities:
Following preprocessing activities are performed using spacy pretrained model
Separating preamble from judgment text
Sentence splitting of judgment text
Convert upper case words in preamble to title case
Replace newline characters within a sentence with space in judgment text
The preprocessing is done using Data object.
texts_to_process = [text1,text2]
data = Data(texts_to_process,preprocessing_nlp_model='en_core_web_trf')
The preprocessing is lazy evaluated.
Trade-Off between Preprocessing Accuracy and Run Time
One can choose which spacy pretrained model to use for preprocessing while creating Data object using parameter preprocessing_nlp_model. The choice of preprocessing model critically determines the performance of AI models. We recommend using ‘en_core_web_trf’ for preprocessing of the data, but it can be slow. Available preprocessing models are ‘en_core_web_trf’ (slowest but best accuracy), ‘en_core_web_md’, ‘en_core_web_sm’(fastest but less accurate)
Additional Parameters while creating Data object
mini_batch_size (int): This accepts an int as batch size for processing of a document, if length of document is bigger that given batch size it will be chunked and then processed.
use_gpu (bool): Functionality to give a choice whether to use GPU for processing or not Setting it True doesn’t ensure GPU will be utilized it need proper support libraries as mentioned in documentation
use_cache (bool): Set it to true if you want to enable caching while preprocessing. Always set this to True.
verbose (bool): Set it to if you want to see progress bar while processing happens
file_ids (list): List of custom file ids to use for documents