SINAN Metadata Information

[1]:

from pysus.online_data import SINAN

[2]:

metadata = SINAN.metadata_df('DENG')
metadata

/home/bida/Projetos/InfoDengue/PySUS/pysus/online_data/SINAN.py:50: FutureWarning: The error_bad_lines argument has been deprecated and will be removed in a future version. Use on_bad_lines in the future.


  df = pd.read_csv(

[2]:

	Nome do campo	Campo	Tipo	Categoria	Descrição	Características	DBF
0	2-Agravo/doença	NaN	VARCHAR(1)	1-Dengue\n2-Chikungunya	Especifica a suspeição do agravo.	Campo Obrigatório\nPreenchendo o código,\na de...	ID_AGRAVO
1	31-Data da Investigação	dt_investigacao	DATE	NaN	Informar a data da investigação	Campo obrigatório\n- Não aceitar Data da\ninve...	DT_INVEST
2	32- Ocupação/ramo de atividade econômica	co_cbo_ocupacao	VARCHAR(6)	NaN	Informar a atividade exercida pelo paciente no...	NaN	ID_OCUPA_N
3	33- Sinais clínicos (Febre)	NaN	VARCHAR(1)	1 – Sim\n2 – Não	Informar qual sinal clínico ou achados\nlabora...	Campo obrigatório	FEBRE
4	33- Sinais clínicos (Mialgia)	NaN	VARCHAR(1)	1 – Sim\n2 – Não	Informar qual sinal clínico ou achados\nlabora...	Campo obrigatório	MIALGIA
...	...	...	...	...	...	...	...
83	70- Dengue grave (Alteração da consciência)	NaN	VARCHAR(1)	1 – Sim\n2 – Não	Informar qual sinal de gravidade foi\nencontrado	Campo obrigatório\nSe campo 62-\nClassificação...	GRAV_CONSC
84	70- Dengue grave (Outros órgãos)	NaN	NaN	NaN	Informar qual sinal de gravidade foi\nencontrado	Campo obrigatório\nSe campo 62-\nClassificação...	GRAV_ORGAO
85	71- Dengue grave Data de início dos sintomas	NaN	DATE	NaN	Informar data de início dos sinais de alarme.\...	Campo obrigatório\nSe campo 62-\nClassificação...	DT_GRAV
86	Informações complementares e observações adici...	ds_observacao	VARCHAR(255)	NaN	Informações complementares e observações\nadic...	Habilitado para todas\nnotificações	DS_OBS
87	NaN	NaN	NaN	NaN	NaN	NaN	NaN

88 rows × 7 columns

[3]:

dict(zip(metadata['DBF'], metadata['Nome do campo']))

[3]:

{'ID_AGRAVO': '2-Agravo/doença',
 'DT_INVEST': '31-Data da Investigação',
 'ID_OCUPA_N': '32- Ocupação/ramo de atividade econômica',
 'FEBRE': '33- Sinais clínicos (Febre)',
 'MIALGIA': '33- Sinais clínicos (Mialgia)',
 'CEFALEIA': '33- Sinais clínicos (Cefaleia)',
 'EXANTEMA': '33- Sinais clínicos (Exantema)',
 'VOMITO': '33- Sinais clínicos (Vômito)',
 'NAUSEA': '33- Sinais clínicos (Náusea)',
 'DOR_COSTAS': '33- Sinais clínicos (Dor nas costas)',
 'CONJUNTVIT': '33- Sinais clínicos (Conjuntivite)',
 'ARTRITE': '33- Sinais clínicos (Artrite)',
 'ARTRALGIA': '33- Sinais clínicos (Artralgia intensa)',
 'PETÉQUIA_N': '33- Sinais clínicos (Petéquias)',
 'LEUCOPENIA': '33- Sinais clínicos (Leucopenia)',
 'LACO': '33- Sinais clínicos (Prova do laço)',
 'DOR_RETRO': '33- Sinais clínicos (Dor retroorbital)',
 'DIABETES': '34- Doenças pré- existentes (Diabetes)',
 'HEMATOLOG': '34- Doenças pré- existentes (Doenças hematológicas)',
 'HEPATOPAT': '34- Doenças pré- existentes (Hepatopatias)',
 'RENAL': '34- Doenças pré- existentes (Doença renal crônica)',
 'HIPERTENSA': '34- Doenças pré- existentes (Hipertensão arterial)',
 'ÁCIDO_PEPT': '34- Doenças pré- existentes (Doença ácido-péptica)',
 'AUTO_IMUNE': '34- Doenças pré- existentes (Doenças auto-imunes)',
 'DT_CHIK_S1': '35-Exame sorológico (IgM) Chikungunya Soro 1 Data da Coleta',
 'DT_CHIK_S2': '36-Exame sorológico (IgM) Chikungunya Soro 2 Data da Coleta',
 'DT_PRNT': '37-Exame PRNT Data da Coleta',
 'RES_CHIKS1': '38-Exame sorológico (IgM) Chikungunya Soro 1 Resultado',
 'RES_CHIKS2': '38-Exame sorológico (IgM) Chikungunya Soro 2 Resultado',
 'RESUL_PRNT': '38-Exame PRNT Resultado',
 'DT_SORO': '39-Exame sorológico (IgM) Dengue Data da Coleta',
 'RESUL_SORO': '40-Exame sorológico (IgM) Dengue Resultado',
 'DT_NS1': '41- Exame NS1 Data da Coleta',
 'RESUL_NS1': '42- Exame NS1 Resultado',
 'DT_VIRAL': '43-Isolamento Viral Data da Coleta',
 'RESUL_VI_N': '43-Isolamento Viral Resultado',
 'DT_PCR': '45-RT-PCR Data da Coleta',
 'RESUL_PCR_': '46-RT-PCR Resultado',
 'SOROTIPO': '47-Sorotipo',
 'HISTOPA_N': '48-Histopatologia Resultado',
 'IMUNOH_N': '49-Imunohistoquímica Resultado',
 'HOSPITALIZ': '50-Ocorreu Hospitalização?',
 'DT_INTERNA': '51-Data Da Internação',
 'UF': '52-UF De Hospitalização',
 'MUNICIPIO': '53-Municipio Do Hospital',
 'HOSPITAL': '54-Nome Do Hospital',
 'DDD_HOSP': '55-(DDD)',
 'TEL_HOSP': '55-Telefone',
 'TPAUTOCTO': '56-O caso é Autóctone de residência?',
 'COUFINF': '57-UF (provável da fonte de infecção)',
 'COPAISINF': '58-País (provável da fonte de infecção)',
 'COMUNINF': '59-Município (provável da fonte de infecção)',
 'CODISINF': '60-Distrito (provável de infecção)',
 'CO_BAINF\nNOBAIINF': '61-Bairro (provável de infecção)',
 'CLASSI_FIN': '62-Classificação',
 'CRITERIO': '63-Critério De Confirmação/Descarte',
 'CLINIC_CHIK': '64- Apresentação clínica Chikungunya',
 'EVOLUCAO': '65-Evolução Do Caso',
 'DT_OBITO': '66-Data do Óbito',
 'DT_ENCERRA': '67-Data Do Encerramento',
 'ALRM_HIPOT': '68- Dengue com sinais de alarme (Hipotensão)',
 'ALRM_PLAQ': '68- Dengue com sinais de alarme (Queda abrupta de plaquetas)',
 'ALRM_VOM': '68- Dengue com sinais de alarme (Vômitos persistentes)',
 'ALRM_ABDOM': '68- Dengue com sinais de alarme (Dor abdominal)',
 'ALRM_LETAR': '68- Dengue com sinais de alarme (Letargia ou irritabilidade)',
 'ALRM_SANG': '68- Dengue com sinais de alarme (Sangramento de mucosa/outras hemorragias)',
 'ALRM_HEMAT': '68- Dengue com sinais de alarme (Aumento hematócrito)',
 'ALRM_HEPAT': '68- Dengue com sinais de alarme (Hepatomegalia)',
 'ALRM_LIQ': '68- Dengue com sinais de alarme (Acúmulo de líquidos)',
 'DT_ALRM': '69- Dengue com sinais de alarme Data de início dos sintomas',
 'GRAV_PULSO': '70- Dengue grave (Pulso débil ou indectável)',
 'GRAV_CONV': '70- Dengue grave (PA convergente)',
 'GRAV_ENCH': '70- Dengue grave (Tempo de enchimento capilar)',
 'GRAV_INSUF': '70- Dengue grave (Acúmulo de líquidos com insuficiência respiratória)',
 'GRAV_TAQUI': '70- Dengue grave (Taquicardia)',
 'GRAV_EXTRE': '70- Dengue grave (Extremidade frias)',
 'GRAV_HIPOT': '70- Dengue grave (Hipotensão arterial em fase tardia)',
 'GRAV_HEMAT': '70- Dengue grave (Hematêmese)',
 'GRAV_MELEN': '70- Dengue grave (Melena)',
 'GRAV_METRO': '70- Dengue grave (Metrorragia volumosa)',
 'GRAV_SANG': '70- Dengue grave (Sangramento do SNC)',
 'GRAV_AST': '70- Dengue grave (AST/ALT >1 .000)',
 'GRAV_MIOC': '70- Dengue grave (Miocardite)',
 'GRAV_CONSC': '70- Dengue grave (Alteração da consciência)',
 'GRAV_ORGAO': '70- Dengue grave (Outros órgãos)',
 'DT_GRAV': '71- Dengue grave Data de início dos sintomas',
 'DS_OBS': 'Informações complementares e observações adicionais',
 nan: nan}

Downloading data from the SINAN database

[4]:

from pysus.ftp.databases.sinan import SINAN
import pandas as pd

sinan = SINAN().load()

SINAN is a database of reported cases of certain diseases that Brazilian law requires to be reported. Unfortunately the data available for free download, corresponds only to the investigated cases not the totality of the reported cases. Nevertheless it’s an interesting dataset.

To find out what are these diseases, we can use PySUS:

[5]:

sinan.diseases

[5]:

{'ACBI': 'Acidente de trabalho com material biológico',
 'ACGR': 'Acidente de trabalho',
 'ANIM': 'Acidente por Animais Peçonhentos',
 'ANTR': 'Atendimento Antirrabico',
 'BOTU': 'Botulismo',
 'CANC': 'Cancêr relacionado ao trabalho',
 'CHAG': 'Doença de Chagas Aguda',
 'CHIK': 'Febre de Chikungunya',
 'COLE': 'Cólera',
 'COQU': 'Coqueluche',
 'DENG': 'Dengue',
 'DERM': 'Dermatoses ocupacionais',
 'DIFT': 'Difteria',
 'ESQU': 'Esquistossomose',
 'EXAN': 'Doença exantemáticas',
 'FMAC': 'Febre Maculosa',
 'FTIF': 'Febre Tifóide',
 'HANS': 'Hanseníase',
 'HANT': 'Hantavirose',
 'HEPA': 'Hepatites Virais',
 'IEXO': 'Intoxicação Exógena',
 'INFL': 'Influenza Pandêmica',
 'LEIV': 'Leishmaniose Visceral',
 'LEPT': 'Leptospirose',
 'LERD': 'LER/Dort',
 'LTAN': 'Leishmaniose Tegumentar Americana',
 'MALA': 'Malária',
 'MENI': 'Meningite',
 'MENT': 'Transtornos mentais relacionados ao trabalho',
 'NTRA': 'Notificação de Tracoma',
 'PAIR': 'Perda auditiva por ruído relacionado ao trabalho',
 'PEST': 'Peste',
 'PFAN': 'Paralisia Flácida Aguda',
 'PNEU': 'Pneumoconioses realacionadas ao trabalho',
 'RAIV': 'Raiva',
 'SDTA': 'Surto Doenças Transmitidas por Alimentos',
 'SIFA': 'Sífilis Adquirida',
 'SIFC': 'Sífilis Congênita',
 'SIFG': 'Sífilis em Gestante',
 'SRC': 'Síndrome da Rubéola Congênia',
 'TETA': 'Tétano Acidental',
 'TETN': 'Tétano Neonatal',
 'TOXC': 'Toxoplasmose Congênita',
 'TOXG': 'Toxoplasmose Gestacional',
 'TRAC': 'Inquérito de Tracoma',
 'TUBE': 'Tuberculose',
 'VARC': 'Varicela',
 'VIOL': 'Violência doméstica, sexual e/ou outras violências',
 'ZIKA': 'Zika Vírus'}

These diseases are available in countrywide tables, so if we want to see the cases of Chagas disease in the state of Minas Gerais, first we can check which years are available:

[6]:

sinan.get_files('CHAG')

[6]:

[CHAGBR00.dbc,
 CHAGBR01.dbc,
 CHAGBR02.dbc,
 CHAGBR03.dbc,
 CHAGBR04.dbc,
 CHAGBR05.dbc,
 CHAGBR06.dbc,
 CHAGBR07.dbc,
 CHAGBR08.dbc,
 CHAGBR09.dbc,
 CHAGBR10.dbc,
 CHAGBR11.dbc,
 CHAGBR12.dbc,
 CHAGBR13.dbc,
 CHAGBR14.dbc,
 CHAGBR15.dbc,
 CHAGBR16.dbc,
 CHAGBR17.dbc,
 CHAGBR18.dbc,
 CHAGBR19.dbc,
 CHAGBR20.dbc,
 CHAGBR21.dbc]

We can also check when it was last updated for every disease, and if the table is preliminary or final.

[7]:

from pysus.online_data import FTP_Inspect
lu = FTP_Inspect('SINAN').last_update_df()
lu

[7]:

	folder	date	file_size	file_name
0	/dissemin/publicos/SINAN/DADOS/FINAIS	2023-01-16 14:15:00	28326	ACBIBR06.dbc
1	/dissemin/publicos/SINAN/DADOS/FINAIS	2023-01-16 14:15:00	673314	ACBIBR07.dbc
2	/dissemin/publicos/SINAN/DADOS/FINAIS	2023-01-16 14:15:00	1048406	ACBIBR08.dbc
3	/dissemin/publicos/SINAN/DADOS/FINAIS	2023-01-16 14:15:00	1493392	ACBIBR09.dbc
4	/dissemin/publicos/SINAN/DADOS/FINAIS	2023-01-16 14:15:00	1632311	ACBIBR10.dbc
...	...	...	...	...
753	/dissemin/publicos/SINAN/DADOS/PRELIM	2023-03-09 16:37:00	169214	VARCBR22.dbc
754	/dissemin/publicos/SINAN/DADOS/PRELIM	2023-08-10 10:28:00	127626	VARCBR23.dbc
755	/dissemin/publicos/SINAN/DADOS/PRELIM	2023-06-26 11:25:00	26007974	VIOLBR21.dbc
756	/dissemin/publicos/SINAN/DADOS/PRELIM	2023-06-26 11:25:00	32269105	VIOLBR22.dbc
757	/dissemin/publicos/SINAN/DADOS/PRELIM	2023-08-23 10:11:00	384075	ZIKABR23.dbc

758 rows × 4 columns

[8]:

lu[lu.file_name.str.startswith('CHAG')]

[8]:

	folder	date	file_size	file_name
80	/dissemin/publicos/SINAN/DADOS/FINAIS	2022-03-28 11:18:00	41075	CHAGBR00.dbc
81	/dissemin/publicos/SINAN/DADOS/FINAIS	2022-03-28 11:18:00	47675	CHAGBR01.dbc
82	/dissemin/publicos/SINAN/DADOS/FINAIS	2022-03-28 11:18:00	69415	CHAGBR02.dbc
83	/dissemin/publicos/SINAN/DADOS/FINAIS	2022-03-28 11:18:00	90539	CHAGBR03.dbc
84	/dissemin/publicos/SINAN/DADOS/FINAIS	2022-03-28 11:18:00	86820	CHAGBR04.dbc
85	/dissemin/publicos/SINAN/DADOS/FINAIS	2022-03-28 11:18:00	223289	CHAGBR05.dbc
86	/dissemin/publicos/SINAN/DADOS/FINAIS	2022-03-28 11:18:00	135953	CHAGBR06.dbc
87	/dissemin/publicos/SINAN/DADOS/FINAIS	2021-11-23 12:21:00	11660	CHAGBR07.dbc
88	/dissemin/publicos/SINAN/DADOS/FINAIS	2021-11-23 12:21:00	11004	CHAGBR08.dbc
89	/dissemin/publicos/SINAN/DADOS/FINAIS	2021-11-23 12:21:00	17913	CHAGBR09.dbc
90	/dissemin/publicos/SINAN/DADOS/FINAIS	2021-11-23 12:21:00	13470	CHAGBR10.dbc
91	/dissemin/publicos/SINAN/DADOS/FINAIS	2021-11-23 12:21:00	17109	CHAGBR11.dbc
92	/dissemin/publicos/SINAN/DADOS/FINAIS	2021-11-23 12:21:00	264167	CHAGBR12.dbc
93	/dissemin/publicos/SINAN/DADOS/FINAIS	2021-11-23 12:21:00	245977	CHAGBR13.dbc
94	/dissemin/publicos/SINAN/DADOS/FINAIS	2021-11-23 12:21:00	261115	CHAGBR14.dbc
95	/dissemin/publicos/SINAN/DADOS/FINAIS	2021-11-23 12:21:00	188615	CHAGBR15.dbc
96	/dissemin/publicos/SINAN/DADOS/FINAIS	2022-03-28 11:18:00	244387	CHAGBR16.dbc
97	/dissemin/publicos/SINAN/DADOS/FINAIS	2021-11-23 12:21:00	215819	CHAGBR17.dbc
98	/dissemin/publicos/SINAN/DADOS/FINAIS	2022-03-28 11:18:00	254336	CHAGBR18.dbc
99	/dissemin/publicos/SINAN/DADOS/FINAIS	2021-11-23 12:21:00	285962	CHAGBR19.dbc
536	/dissemin/publicos/SINAN/DADOS/PRELIM	2022-02-09 15:16:00	189312	CHAGBR20.dbc
537	/dissemin/publicos/SINAN/DADOS/PRELIM	2023-04-20 12:17:00	282009	CHAGBR21.dbc

We can see, that we have data in final form, from 2000 until 2019, and preliminary data for 2020. Now we can download it:

[9]:

df = sinan.download(sinan.get_files('CHAG', 2019)).to_dataframe()
df

CHAGBR19.parquet: 100%|████████████████████████████████████████████████████████████████████████████████▉| 8.44k/8.44k [00:00<00:00, 19.4kB/s]

[9]:

	TP_NOT	ID_AGRAVO	DT_NOTIFIC	SEM_NOT	NU_ANO	SG_UF_NOT	ID_MUNICIP	ID_REGIONA	ID_UNIDADE	DT_SIN_PRI	...	DT_OBITO	CON_PROVAV	CON_OUTRA	CON_LOCAL	TPAUTOCTO	COUFINF	COPAISINF	COMUNINF	DOENCA_TRA	DT_ENCERRA
0	2	B571	2019-04-10	201915	2019	16	160030		2019639	2019-03-01	...		5		2	1	16	1	160030	2	20190513
1	2	B571	2019-09-16	201938	2019	16	160030		2022192	2019-08-18	...		5		2	2	16	1	160060	2	20191002
2	2	B571	2019-03-07	201910	2019	16	160030		2022192	2019-02-28	...		5		2	1	16	1	160030		20190325
3	2	B571	2019-10-22	201943	2019	16	160030		2020653	2019-09-09	...		5		2	1	16	1	160030	2	20191107
4	2	B571	2019-09-10	201937	2019	16	160060		2020971	2019-08-28	...		5		2	1	16	1	160060	2	20191108
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4477	2	B571	2019-09-05	201936	2019	26	260120	1501	5740592	2019-09-04	...							0			20191022
4478	2	B571	2019-09-26	201939	2019	26	261390	1506	2348489	2019-08-26	...		2		2	1	26	1	261390	2	20191008
4479	2	B571	2019-01-17	201903	2019	26	260120	1501	2711443	2019-01-05	...							0			20190212
4480	2	B571	2019-07-03	201927	2019	26	260820	1498	5276403	2019-07-03	...							0			20190903
4481	2	B571	2019-11-08	201945	2019	26	261220	1502	5844916	2019-08-04	...							0			20200124

4482 rows × 99 columns

Downloading large files

Some SINAN files can be quite large and can take a bit longer to download and convert. As the default behavior is to download data in chunks, some folders may contain lots of parquet chunks

[10]:

fn = sinan.download(sinan.get_files('DENG', 2020))

DENGBR20.parquet: 100%|█████████████████████████████████████████████████████████████████████████████████| 2.40M/2.40M [02:24<00:00, 16.6kB/s]

The cases of dengue where downloaded to multiple chunks to the directory above

[11]:

import os
len(os.listdir(str(fn)))

[11]:

[12]:

df2 = fn.to_dataframe()
len(df2)

[12]:

Decoding the age in SINAN tables

In SINAN the age comes encoded. PySUS can decode the age column NU_IDADE_N into any of these units: years, months, days, or hours.

[13]:

from glob import glob
for i, f in enumerate(glob(f"{fn}/*.parquet")):
    if i == 0:
        df2 = pd.read_parquet(f)
    else:
        df2 = pd.concat([df2, pd.read_parquet(f)], ignore_index=True)
df2

[13]:

	TP_NOT	ID_AGRAVO	DT_NOTIFIC	SEM_NOT	NU_ANO	SG_UF_NOT	ID_MUNICIP	ID_REGIONA	ID_UNIDADE	DT_SIN_PRI	...	LACO_N	PLASMATICO	EVIDENCIA	PLAQ_MENOR	CON_FHD	COMPLICA	TP_SISTEMA	NDUPLIC_N	CS_FLXRET	FLXRECEBI
0	2	A90	20200424		2020	35	355410	1352	2046458	20200420	...							2		1
1	2	A90	20200430		2020	35	351110	1354	7340885	20200428	...							2		0
2	2	A90	20200331		2020	35	351440	1573	7959222	20200328	...							2		0
3	2	A90	20200421		2020	35	352270	1337	2079836	20200421	...							2		0
4	2	A90	20200516		2020	35	350280	1336	2059320	20200511	...							2		1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1495112	2	A90	20200331		2020	31	310620	1449	0023310	20200330	...							2		1
1495113	2	A90	20200416		2020	31	313240	1459	3147932	20200407	...							2		0
1495114	2	A90	20200414		2020	31	310620	1449	0027995	20200413	...							2		1
1495115	2	A90	20200214		2020	31	311660	1472	2160307	20200211	...							2		1
1495116	2	A90	20200316		2020	31	310620	1449	9653848	20200316	...							2		1

1495117 rows × 119 columns

Decoding the age in SINAN tables

In SINAN the age comes encoded. PySUS can decode the age column NU_IDADE_N into any of these units: years, months, days, or hours.

[14]:

from pysus.preprocessing.decoders import decodifica_idade_SINAN
decodifica_idade_SINAN?

Signature:       decodifica_idade_SINAN(*args, **kwargs)
Type:            vectorize
String form:     <numpy.vectorize object at 0x7f89472dff50>
File:            ~/micromamba/envs/pysus/lib/python3.11/site-packages/numpy/__init__.py
Docstring:
Em tabelas do SINAN frequentemente a idade é representada como um inteiro que precisa ser parseado
para retornar a idade em uma unidade cronológica padrão.
:param unidade: unidade da idade: 'Y': anos, 'M' meses, 'D': dias, 'H': horas
:param idade: inteiro ou sequencia de inteiros codificados.
:return:
Class docstring:
vectorize(pyfunc, otypes=None, doc=None, excluded=None, cache=False,
          signature=None)

Generalized function class.

Define a vectorized function which takes a nested sequence of objects or
numpy arrays as inputs and returns a single numpy array or a tuple of numpy
arrays. The vectorized function evaluates `pyfunc` over successive tuples
of the input arrays like the python map function, except it uses the
broadcasting rules of numpy.

The data type of the output of `vectorized` is determined by calling
the function with the first element of the input.  This can be avoided
by specifying the `otypes` argument.

Parameters
----------
pyfunc : callable
    A python function or method.
otypes : str or list of dtypes, optional
    The output data type. It must be specified as either a string of
    typecode characters or a list of data type specifiers. There should
    be one data type specifier for each output.
doc : str, optional
    The docstring for the function. If None, the docstring will be the
    ``pyfunc.__doc__``.
excluded : set, optional
    Set of strings or integers representing the positional or keyword
    arguments for which the function will not be vectorized.  These will be
    passed directly to `pyfunc` unmodified.

    .. versionadded:: 1.7.0

cache : bool, optional
    If `True`, then cache the first function call that determines the number
    of outputs if `otypes` is not provided.

    .. versionadded:: 1.7.0

signature : string, optional
    Generalized universal function signature, e.g., ``(m,n),(n)->(m)`` for
    vectorized matrix-vector multiplication. If provided, ``pyfunc`` will
    be called with (and expected to return) arrays with shapes given by the
    size of corresponding core dimensions. By default, ``pyfunc`` is
    assumed to take scalars as input and output.

    .. versionadded:: 1.12.0

Returns
-------
vectorized : callable
    Vectorized function.

See Also
--------
frompyfunc : Takes an arbitrary Python function and returns a ufunc

Notes
-----
The `vectorize` function is provided primarily for convenience, not for
performance. The implementation is essentially a for loop.

If `otypes` is not specified, then a call to the function with the
first argument will be used to determine the number of outputs.  The
results of this call will be cached if `cache` is `True` to prevent
calling the function twice.  However, to implement the cache, the
original function must be wrapped which will slow down subsequent
calls, so only do this if your function is expensive.

The new keyword argument interface and `excluded` argument support
further degrades performance.

References
----------
.. [1] :doc:`/reference/c-api/generalized-ufuncs`

Examples
--------
>>> def myfunc(a, b):
...     "Return a-b if a>b, otherwise return a+b"
...     if a > b:
...         return a - b
...     else:
...         return a + b

>>> vfunc = np.vectorize(myfunc)
>>> vfunc([1, 2, 3, 4], 2)
array([3, 4, 1, 2])

The docstring is taken from the input function to `vectorize` unless it
is specified:

>>> vfunc.__doc__
'Return a-b if a>b, otherwise return a+b'
>>> vfunc = np.vectorize(myfunc, doc='Vectorized `myfunc`')
>>> vfunc.__doc__
'Vectorized `myfunc`'

The output type is determined by evaluating the first element of the input,
unless it is specified:

>>> out = vfunc([1, 2, 3, 4], 2)
>>> type(out[0])
<class 'numpy.int64'>
>>> vfunc = np.vectorize(myfunc, otypes=[float])
>>> out = vfunc([1, 2, 3, 4], 2)
>>> type(out[0])
<class 'numpy.float64'>

The `excluded` argument can be used to prevent vectorizing over certain
arguments.  This can be useful for array-like arguments of a fixed length
such as the coefficients for a polynomial as in `polyval`:

>>> def mypolyval(p, x):
...     _p = list(p)
...     res = _p.pop(0)
...     while _p:
...         res = res*x + _p.pop(0)
...     return res
>>> vpolyval = np.vectorize(mypolyval, excluded=['p'])
>>> vpolyval(p=[1, 2, 3], x=[0, 1])
array([3, 6])

Positional arguments may also be excluded by specifying their position:

>>> vpolyval.excluded.add(0)
>>> vpolyval([1, 2, 3], x=[0, 1])
array([3, 6])

The `signature` argument allows for vectorizing functions that act on
non-scalar arrays of fixed length. For example, you can use it for a
vectorized calculation of Pearson correlation coefficient and its p-value:

>>> import scipy.stats
>>> pearsonr = np.vectorize(scipy.stats.pearsonr,
...                 signature='(n),(n)->(),()')
>>> pearsonr([[0, 1, 2, 3]], [[1, 2, 3, 4], [4, 3, 2, 1]])
(array([ 1., -1.]), array([ 0.,  0.]))

Or for a vectorized convolution:

>>> convolve = np.vectorize(np.convolve, signature='(n),(m)->(k)')
>>> convolve(np.eye(4), [1, 2, 1])
array([[1., 2., 1., 0., 0., 0.],
       [0., 1., 2., 1., 0., 0.],
       [0., 0., 1., 2., 1., 0.],
       [0., 0., 0., 1., 2., 1.]])
Call docstring:
Return arrays with the results of `pyfunc` broadcast (vectorized) over
`args` and `kwargs` not in `excluded`.

We can easily convert dates and numerical fields in the dataframe:

[15]:

for cname in df.columns:
    if cname.startswith('DT_'):
        df[cname] = pd.to_datetime(df[cname], errors='coerce')
    elif cname.startswith('ID_'):
        try:
            df[cname] = pd.to_numeric(df[cname])
        except ValueError:
            continue
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4482 entries, 0 to 4481
Data columns (total 99 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   TP_NOT      4482 non-null   string
 1   ID_AGRAVO   4482 non-null   string
 2   DT_NOTIFIC  4482 non-null   datetime64[ns]
 3   SEM_NOT     4482 non-null   string
 4   NU_ANO      4482 non-null   string
 5   SG_UF_NOT   4482 non-null   string
 6   ID_MUNICIP  4482 non-null   int64
 7   ID_REGIONA  3633 non-null   float64
 8   ID_UNIDADE  4482 non-null   int64
 9   DT_SIN_PRI  4482 non-null   datetime64[ns]
 10  SEM_PRI     4482 non-null   string
 11  DT_NASC     4447 non-null   datetime64[ns]
 12  NU_IDADE_N  4482 non-null   string
 13  CS_SEXO     4482 non-null   string
 14  CS_GESTANT  4482 non-null   string
 15  CS_RACA     4482 non-null   string
 16  CS_ESCOL_N  4482 non-null   string
 17  SG_UF       4482 non-null   string
 18  ID_MN_RESI  4482 non-null   int64
 19  ID_RG_RESI  3682 non-null   float64
 20  ID_PAIS     4482 non-null   int64
 21  NDUPLIC_N   4482 non-null   string
 22  DT_INVEST   4329 non-null   datetime64[ns]
 23  ID_OCUPA_N  4482 non-null   string
 24  ANT_UF_1    4482 non-null   string
 25  MUN_1       4482 non-null   string
 26  ANT_UF_2    4482 non-null   string
 27  MUN_2       4482 non-null   string
 28  ANT_UF_3    4482 non-null   string
 29  MUN_3       4482 non-null   string
 30  PRESENCA    4482 non-null   string
 31  PARASITO    4482 non-null   string
 32  HISTORIA    4482 non-null   string
 33  CONTROLE    4482 non-null   string
 34  MANIPULA    4482 non-null   string
 35  MAECHAGA    4482 non-null   string
 36  ORAL        4482 non-null   string
 37  ASSINTOMA   4482 non-null   string
 38  EDEMA       4482 non-null   string
 39  MENINGOE    4482 non-null   string
 40  POLIADENO   4482 non-null   string
 41  FEBRE       4482 non-null   string
 42  HEPATOME    4482 non-null   string
 43  SINAIS_ICC  4482 non-null   string
 44  ARRITMIAS   4482 non-null   string
 45  ASTENIA     4482 non-null   string
 46  ESPLENOM    4482 non-null   string
 47  CHAGOMA     4482 non-null   string
 48  OUTRO_SIN   4482 non-null   string
 49  OUTRO_ESP   4482 non-null   string
 50  DT_COL_DIR  2693 non-null   datetime64[ns]
 51  EXAME       4482 non-null   string
 52  MICRO_HEMA  4482 non-null   string
 53  OUTRO       4482 non-null   string
 54  DT_COL_IND  155 non-null    datetime64[ns]
 55  XENODIAG    4482 non-null   string
 56  HEMOCULT    4482 non-null   string
 57  DT_COL_S1   3300 non-null   datetime64[ns]
 58  DT_COL_S2   790 non-null    datetime64[ns]
 59  ELI_IGM_S1  4482 non-null   string
 60  ELI_IGG_S1  4482 non-null   string
 61  ELI_IGM_S2  4482 non-null   string
 62  ELI_IGG_S2  4482 non-null   string
 63  HEM_IGM_S1  4482 non-null   string
 64  HEM_IGG_S1  4482 non-null   string
 65  HEM_IGM_S2  4482 non-null   string
 66  HEM_IGG_S2  4482 non-null   string
 67  IMU_IGM_S1  4482 non-null   string
 68  TIT_IGM_S1  4482 non-null   string
 69  IMU_IGM_S2  4482 non-null   string
 70  TIT_IGM_S2  4482 non-null   string
 71  IMU_IGG_S1  4482 non-null   string
 72  TIT_IGG_S1  4482 non-null   string
 73  IMU_IGG_S2  4482 non-null   string
 74  TIT_IGG_S2  4482 non-null   string
 75  RESUL_HIS   4482 non-null   string
 76  RES_HIST    4482 non-null   string
 77  ESPECIFICO  4482 non-null   string
 78  SINTOMATIC  4482 non-null   string
 79  DROGA       4482 non-null   string
 80  TEMPO       4482 non-null   string
 81  CON_TRIAT   4482 non-null   string
 82  BIOSSEG     4482 non-null   string
 83  FISCALIZA   4482 non-null   string
 84  MED_OUTRO   4482 non-null   string
 85  OUTRO_DES   4482 non-null   string
 86  CLASSI_FIN  4482 non-null   string
 87  CRITERIO    4482 non-null   string
 88  EVOLUCAO    4482 non-null   string
 89  DT_OBITO    54 non-null     datetime64[ns]
 90  CON_PROVAV  4482 non-null   string
 91  CON_OUTRA   4482 non-null   string
 92  CON_LOCAL   4482 non-null   string
 93  TPAUTOCTO   4482 non-null   string
 94  COUFINF     4482 non-null   string
 95  COPAISINF   4482 non-null   string
 96  COMUNINF    4482 non-null   string
 97  DOENCA_TRA  4482 non-null   string
 98  DT_ENCERRA  4366 non-null   datetime64[ns]
dtypes: datetime64[ns](10), float64(2), int64(4), string(83)
memory usage: 3.4 MB

Let’s convert the age to years and save it on a different column.

[16]:

df['idade_anos'] = decodifica_idade_SINAN(df.NU_IDADE_N, 'Y')
df[['NU_IDADE_N', 'idade_anos']]

[16]:

	NU_IDADE_N	idade_anos
0	4013	13.0
1	4054	54.0
2	4031	31.0
3	4063	63.0
4	4036	36.0
...	...	...
4477	4079	79.0
4478	4063	63.0
4479	4039	39.0
4480	4036	36.0
4481	4052	52.0

4482 rows × 2 columns

Saving the Modified data

We can save our dataframe in any format we wish to avoid having to redo this analysis next time. If we want to keep only the data from the state of Minas Gerais we need to filter the table using the UF code 31.

[17]:

df['SG_UF_NOT'] = df.SG_UF_NOT.astype(int)
df[df.SG_UF_NOT==31].to_csv('chagas_SP_2018_mod.csv',sep=';',compression='zip')