Elastic Search missing some documents while creating index from another index
Elastic Search missing some documents while creating index from another index
I have created one index in elastic search name as documents_local
and load data from Oracle Database via logstash. This index containing following values.
documents_local
"FilePath" : "Path of the file",
"FileName" : "filename.pdf",
"Language" : "Language_name"
Then, i want to index those file contents also, so i have created one more index in elastic search named as document_attachment
and using the following Java code, i fetched FilePath
& FileName
from the index documents_local
with the help of the filepath
i retrived file will is available in my local drive and i have index those file contents using ingest-attachment plugin processor.
document_attachment
FilePath
FileName
documents_local
filepath
Please find my java code below, where am indexing the files.
private final static String INDEX = "documents_local"; //Documents Table with file Path - Source Index
private final static String ATTACHMENT = "document_attachment"; // Documents with Attachment... -- Destination Index
private final static String TYPE = "doc";
public static void main(String args) throws IOException {
RestHighLevelClient restHighLevelClient = null;
Document doc=new Document();
try {
restHighLevelClient = new RestHighLevelClient(RestClient.builder(new HttpHost("localhost", 9200, "http"),
new HttpHost("localhost", 9201, "http")));
} catch (Exception e) {
System.out.println(e.getMessage());
}
//Fetching Id, FilePath & FileName from Document Index.
SearchRequest searchRequest = new SearchRequest(INDEX);
searchRequest.types(TYPE);
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
QueryBuilder qb = QueryBuilders.matchAllQuery();
searchSourceBuilder.query(qb);
searchSourceBuilder.size(3000);
searchRequest.source(searchSourceBuilder);
SearchResponse searchResponse = null;
try {
searchResponse = restHighLevelClient.search(searchRequest);
} catch (IOException e) {
e.getLocalizedMessage();
}
SearchHit searchHits = searchResponse.getHits().getHits();
long totalHits=searchResponse.getHits().totalHits;
int line=1;
String docpath = null;
Map<String, Object> jsonMap ;
for (SearchHit hit : searchHits) {
String encodedfile = null;
File file=null;
Map<String, Object> sourceAsMap = hit.getSourceAsMap();
doc.setId((int) sourceAsMap.get("id"));
doc.setLanguage(sourceAsMap.get("language"));
doc.setFilename(sourceAsMap.get("filename").toString());
doc.setPath(sourceAsMap.get("path").toString());
String filepath=doc.getPath().concat(doc.getFilename());
System.out.println("Line Number--> "+line+++"ID---> "+doc.getId()+"File Path --->"+filepath);
file = new File(filepath);
if(file.exists() && !file.isDirectory()) {
try {
FileInputStream fileInputStreamReader = new FileInputStream(file);
byte bytes = new byte[(int) file.length()];
fileInputStreamReader.read(bytes);
encodedfile = new String(Base64.getEncoder().encodeToString(bytes));
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
jsonMap = new HashMap<>();
jsonMap.put("id", doc.getId());
jsonMap.put("language", doc.getLanguage());
jsonMap.put("filename", doc.getFilename());
jsonMap.put("path", doc.getPath());
jsonMap.put("fileContent", encodedfile);
String id=Long.toString(doc.getId());
IndexRequest request = new IndexRequest(ATTACHMENT, "doc", id )
.source(jsonMap)
.setPipeline(ATTACHMENT);
try {
IndexResponse response = restHighLevelClient.index(request);
} catch(ElasticsearchException e) {
if (e.status() == RestStatus.CONFLICT) {
}
e.printStackTrace();
}
}
System.out.println("Indexing done...");
}
Please find my mappings details for ingest attachment plugin
( i did mappping first and then executing this java code ).
ingest attachment plugin
PUT _ingest/pipeline/document_attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "fileContent"
}
}
]
}
But while doing this process, Am missing some of the doucments,
My source index is documents_local
which is having 2910
documents.
Am fetching all the 118
documents and attaching my PDF ( after converting to base64) and writting in another index document_attachment
documents_local
2910
118
document_attachment
But document_attachment
index having only 118
it should be 2910
. some of the doucments are missing. Also, for indexing its taking very long time.
document_attachment
118
2910
Am not sure, how the documents are missing in second index ( document_attachment
) and is there any other way to improvise this process ?
document_attachment
Can we include Thread mechanism here. Because, i future i have to index more than 100k pdf, in this same way.
@AndrePiantino Am using elasticsearch 6.2.3 version
– Karthikeyan
2 days ago
@Val - I have attached my
ingest attachment plugin
mappings details. I did my mapping first for my index document_attachment
. Then am executing this java code.– Karthikeyan
2 days ago
ingest attachment plugin
document_attachment
And you don't see any stack traces when running your code? any errors in the ES logs?
– Val
2 days ago
Some of the files is not available in the directory. so i make a code like, if the files is not available in the path, then ignore the
fileContent
( keep it as null ) and index the rest of the content. This is what i planned and my approach also. But it is not happening in this way..– Karthikeyan
2 days ago
fileContent
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
What is the elastic version?
– Andre Piantino
2 days ago